# Representating text as Tensors

To solve NLP tasks, we need to represent text as tensors. We can use different approaches when representing text :
- Character level representation, where we represent text by treating each character as a number. 
- Word-level representation, in which we create a vocabulary of all words in our text, and then represent words using one-hot-encoding. 

## Text Classification

We'll start with a text classification task based on the AG_NEWS dataset. We'll classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech

In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

In [3]:
# Load the dataset
dataset = tfds.load('ag_news_subset')

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\Ahmed\tensorflow_datasets\ag_news_subset\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling ag_news_subset-train.tfrecord...:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling ag_news_subset-test.tfrecord...:   0%|          | 0/7600 [00:00<?, ? examples/s]

[1mDataset ag_news_subset downloaded and prepared to C:\Users\Ahmed\tensorflow_datasets\ag_news_subset\1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
ds_train = dataset['train']
ds_test = dataset['test']

print(f"Length of train dataset = {len(ds_train)}")
print(f"Length of test dataset = {len(ds_test)}")

Length of train dataset = 120000
Length of test dataset = 7600


In [5]:
classes = ['World','Sports','Business','Sci/Tech']

for i,x in zip(range(5), ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
1 (Sports) -> b"Wood's Suspension Upheld (Reuters)" b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.'
2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'
3 (Sci/Tech) -> b"'Halt science decline in schools'" b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'
1 (Sports) -> b'Gerrard leaves practice' b'London, England (Sports Network

### Text Vectorization

We need to convert text into numbers that can be represented as tensors. For word-level representation, we need two things :
- Use a tokenizer to split texts into tokens
- Build a vocabulary of those tokens

The vocabulary size is big in the dataset (more than 100k words). Generally speaking, we don't need words that are rarely present in the text so we are going to limit the vocabulary size to a smaller number by passing an argument to the vectorizer constructor. 

In [22]:
vocab_size = 50000
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title']+' '+x['description']))

In [23]:
vocab = vectorizer.get_vocabulary()
vocab_size = len(vocab)
print(vocab[:10])
print(f"Length of vocabulary : {vocab_size}")

['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for']
Length of vocabulary : 5335


### Bag-of-words text representation

In Bag-of-Words(BoW) vector representation, each word is linked to a vector index, and a vector element contains the number of occurences of each word in a given document

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
sc_vectorizer = CountVectorizer()
corpus = [
    'I like hot dogs.',
    'The dog ran fast',
    'Its hot outside',
]
sc_vectorizer.fit_transform(corpus)
sc_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)

We can also use the Keras that we defined above, converting each word number into a one-hot-encoding and adding all those vectors up

In [16]:
def to_bow(text):
    return tf.reduce_sum(tf.one_hot(vectorizer(text), vocab_size), axis=0)

to_bow('My dog likes hot dogs on a hot day.').numpy()

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

### Training the BoW classifier

Let's train a classifier that uses the BoW representation. First, we need to convert our dataset to a bag-of-words representation. This can be achieved by using map function in the following way

In [24]:
batch_size = 128

ds_train_bow = ds_train.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)
ds_test_bow = ds_test.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)

Now, let's define a simple classifier neural network that contains one linear layer. The input size is vocab_size, and the output size corresponds to the number of classes (4). 

In [25]:
model = keras.models.Sequential([
    keras.layers.Dense(4, activation='softmax', input_shape=(vocab_size,))
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_bow, validation_data=(ds_test_bow))



<keras.callbacks.History at 0x215bb43a7c0>

### Training a classifier as one network

Because the vectorizer is also a Keras layer, we can define a network that includes it, and train it end-to-end. This way we don't need to vectorize the dataset using map, we can just pass the original dataset to the input of the network

In [3]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

In [29]:
inp = keras.Input(shape=(1,), dtype=tf.string)
x = vectorizer(inp)
x = tf.reduce_sum(tf.one_hot(x, vocab_size), axis=1)
out = keras.layers.Dense(4, activation='softmax')(x)
model = keras.models.Model(inp, out)
model.summary()

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size), validation_data=ds_test.map(tupelize).batch(batch_size))

Model: "model_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization_4 (TextVe (None, None)              0         
_________________________________________________________________
tf.one_hot_3 (TFOpLambda)    (None, None, 5335)        0         
_________________________________________________________________
tf.math.reduce_sum_3 (TFOpLa (None, 5335)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 4)                 21344     
Total params: 21,344
Trainable params: 21,344
Non-trainable params: 0
_________________________________________________________________


<keras.callbacks.History at 0x214d18bfe80>

### Bigrams, trigrams and n-grams

One limitation of the bag-of-words approach is that some words are part of multi-word expressions, for example, the word 'hot dog' has a completely different meaning from the words 'hot' and 'dog' in other contexts. If we represent the words 'hot' and 'dog' always using the same vectors, it can confuse our model

To address this, n-grams representation are often used in methods of document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. In bigram representations, for example, we will add all word in pairs to the vocabulary, in addition to original words

Example of the genration of a bigram bag of word representation with Scikit Learn 

In [27]:
bigram_vectorizer = CountVectorizer(ngram_range=(1,2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

Vocabulary:
 {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}


array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

The main drawback of the n-gram approach is that the vocabulary size starts to grow extremely fast. In practice, we need to combine the n-gram representation with a dimensionality reduction technique, such as embeddings.

### Automatically calculating BoW Vectors

In the example above, we calculated BoW vectors by hand by summing the one-hot-encodings of individual words. However, the latest version of Tensorflow allows us to calculate BoW vectors automatically by passing the `output_mode='count'` parameter to the vectorizer constructor. This makes defining and training our model significantly easier

In [30]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size, output_mode='count'),
    keras.layers.Dense(4, input_shape=(vocab_size, ), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size), validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x215bcf4cb80>

### Term Frequency - inverse document frequency (TF-IDF)

In Bow representation, word occurences are weighted using the same technique regardless of the word itself. However, it's clear that frequent words such as *a* and *in* are much less important for classification than specialized terms. In most NLP tasks some words are more relevant than others

TF-IDF is a variation of bag-of-words, where instead of a binary 0/1 value indicating the appearence of a word in a document, a floating-point value is used, which is related to the frequency of the word occurence in the corpus.
The TF-IDF value increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others.

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,
        0.43381609, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

In Keras, the TextVectorization layer can automatically compute TF-IDF frequencies by passing the `output_mode='tf-idf'` parameter. 

In [32]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size, output_mode='tf-idf'),
    keras.layers.Dense(4, input_shape=(vocab_size, ), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size), validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x215bb737a30>

Even though TF-IDF representations provide frequency weights to different words, they are unable to represent meaning or order. We will learn how to capture contextual information from text using language modeling

# Represent words with embeddings

The idea of embedding is to represent words using lower-dimensional dense vectors that reflect the semantic meaning of the word. 

In [33]:
vocab_size = 30000
batch_size = 128

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size, input_shape=(1,))

model = keras.models.Sequential([
    vectorizer,
    # Embedding layer takes n numbers and reduces each number to a dense vector
    keras.layers.Embedding(vocab_size, 100),
    # Aggregation layer computes the average of all n input tensors corresponding to different words
    keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
    keras.layers.Dense(4, activation="softmax")
])

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_8 (TextVe (None, None)              0         
_________________________________________________________________
embedding (Embedding)        (None, None, 100)         3000000   
_________________________________________________________________
lambda (Lambda)              (None, 100)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 4)                 404       
Total params: 3,000,404
Trainable params: 3,000,404
Non-trainable params: 0
_________________________________________________________________


In [34]:
print("Training vectorizer")
vectorizer.adapt(ds_train.take(500).map(extract_text))

model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size), validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x215bb535940>

## Semantic embeddings: Word2Vec

In the previous example, the embedding layer learned to map words to vector representations, however, these representations did not have semantic meaning. To learn a vector representation in which similar words or synonyms corresponds to vectors that are close to each other in terms of some vector distance, we need to pretrain our embedding model on a large collection of text using a technique such as Word2Vec.

Word2Vec is based on two main architectures that are used to produce a distributed representation of words :
- Continuous bag-of-words (CBoW), where we train the model to predict a word from the surrounding context
- Continuous skip-gram is opposite to CBoW. The model uses the surrounding window of context words to predict the current word

To experiment with the Word2Vec embedding pretrained on Google News dataset, we can use the **gensim** library

In [35]:
import gensim.downloader as api
w2v = api.load('word2vec-google-news-300')



In [36]:
for w,p in w2v.most_similar('neural'):
    print(f"{w} -> {p}")

neuronal -> 0.780479907989502
neurons -> 0.7326499223709106
neural_circuits -> 0.7252850532531738
neuron -> 0.7174385190010071
cortical -> 0.6941086649894714
brain_circuitry -> 0.6923245787620544
synaptic -> 0.6699119210243225
neural_circuitry -> 0.6638563275337219
neurochemical -> 0.6555314660072327
neuronal_activity -> 0.6531826853752136


With Semantic embeddings, we can manipulate the vector encoding based on semantics. For example, we can ask to find a word whose vector representation is as close as possible to the words king and woman, and as far as possible from the word man. 

In [37]:
w2v.most_similar(positive=['king', 'woman'], negative=['man'])[0]

('queen', 0.7118193507194519)

Word2Vec seems like a great way to express word semantics but it has many disadvantages, including the following:
- Both CBoW and skip-gram models are predictive embeddings, and they only take local context into account. 
Word2Vec does not take advantage of global context
- Word2Vec does not take into account word morphology, i.e the fact that the meaning of the word can depend on different parts of the word such as the root

**FastText** tries to overcome the second limitation, and builds on Word2Vec by learning vector representation for each word and the character n-grams found within each word. The values of the representations are then averaged into one vector at each training step. While this adds a lot of additional computation to pretraining, it enables word embeddings to encode sub-word information

Another method, **GloVe** uses a different approach to word embeddings, based on the factorization of the word-context matrix. First, it builds a large matrix that counts the number of word occurences in different contexts, and then it tries to represent this matrix in lower dimensions in a way that minimizes reconstruction loss 

## Using pretrained embeddings in Keras

There are two possible options: tokenizer vocabulary and vocabulary from Word2Vec embeddings

### Tokenizer Vocabulary

In this method, some of the words from the vocabulary will have corresponding Word2Vec embeddings, and some will be missing. 

In [40]:
embed_size = len(w2v.get_vector('hello'))
print(f"Embedding size: {embed_size}")

vocab = vectorizer.get_vocabulary()
W = np.zeros((vocab_size, embed_size))
print('Populating matrix, this will take some time...', end='')
found, not_found = 0, 0
for i, w in enumerate(vocab):
    try:
        W[i] = w2v.get_vector(w)
        found += 1
    except:
        not_found += 1
print(f"Done, found {found} words, {not_found} words missing")

Embedding size: 300
Populating matrix, this will take some time...Done, found 4551 words, 784 words missing


For words that are not present in the Word2Vec vocabulary, we can either leave them as zeroes, or generate a random vector

In [41]:
emb = keras.layers.Embedding(vocab_size, embed_size, weights=[W], trainable=False) 
# Trainable=False will not retrain the embeddings
model = keras.models.Sequential([
    vectorizer, emb,
    keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
    keras.layers.Dense(4, activation='softmax')
])

In [42]:
model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size), 
          validation_data=ds_test.map(tupelize).batch(batch_size))



<keras.callbacks.History at 0x216c7d2a1f0>

### Embeddings vocabulary

One issue with the previous approach is that the vocabularies used in the TextVectorization and Embedding are different. To overcome this problem, we can use one of the following solutions :
- Retrain Word2Vec model on our vocabulary
- Load our dataset with the vocabulary from the pretrained Word2Vec model. Vocabularies used to load the dataset can be specified during loading

In [59]:
vocab = list(w2v.index_to_key)
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(input_shape=(1,))
vectorizer.set_vocabulary(vocab)

In [None]:
""""model = keras.models.Sequential([
    vectorizer,
    w2v.wv.get_keras_embedding(train_embeddings=False),
    keras.layers.Lambda(lambda x: tf.reduce_mean(x, axis=1)),
    keras.layers.Dense(4, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(128),
          validation_data=ds_test.map(tupelize).batch(128), epochs=5)""""

### Contextual Embeddings

One key limitation of traditional pretrained embedding representations such as Word2Vec is the fact that, even though they can capture some meaning of a word, they can't differentiate between different meanings. This can cause problems in downstream models.

For example the word 'play' has different meaning in these two different sentences:
- I went to a **play** at the theater.
- John wants to **play** with his friends.

The pretrained embeddings we talked about represent both meanings of the word 'play' in the same embedding. To overcome this limitation, we need to build embeddings based on the **language model**, which is trained on a large corpus of text, and *knows* how words can be put together in different contexts.

# Capture patterns with recurrent neural network

## Recurrent neural network

The models we've benn using so far are unable to represent word orderingn they cannot solve more complex or ambiguous tasks such as text generation or question answering. To capture the meaning of a text sequence, we'll use a neural network architecture called **recurrent neural network**, or RNN. In RNN, we pass our sentence through the network one token at a time, and the network produces some state, which we then pass to the network agin with the next token

In [62]:
batch_size = 16
embed_size = 64

### Simple RNN classifier

In this architecture, each recurrent unit is a simple linear network, which takes in an input vector and state vector and produces a new state vector. 

> In cases where the dimensionality isn't so high, it might make sense to pass one-hot encoded tokens directly into the RNN cell  

In [63]:
vocab_size = 20000

vectorizer = keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=vocab_size,
    input_shape=(1,)
)

model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4, activation='softmax')
])

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization_13 (TextV (None, None)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, None, 64)          1280000   
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 16)                1296      
_________________________________________________________________
dense_11 (Dense)             (None, 4)                 68        
Total params: 1,281,364
Trainable params: 1,281,364
Non-trainable params: 0
_________________________________________________________________


RNNs in general are quite difficult to train, because once the RNN cells are unrolled along sequence length, the resulting number of layers involved in backpropagation is quite large. Thus we need to select a smaller learning rate, and train the network on a larger dataset to produce good results   

In [64]:
def extract_title(x):
    return x['title']

def tupelize_title(x):
    return (extract_title(x), x['label'])

In [65]:
print('Training vectorizer')
vectorizer.adapt(ds_train.take(2000).map(extract_title))

model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize_title).batch(batch_size), 
          validation_data=ds_test.map(tupelize_title).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x216c7d74880>

The `TextVectorization` layer will automatically pad sequences of variable length in a minibatch with pad tokens. Those tokens also take part in training, and they can complicate the convergence of the model

To minimize the amount of padding, there are several approaches :
- We can reorder the dataset by sequence length and group all sequences by size. This can be done with `tf.data.experimental.bucket_by_sequence_length` function
- We can also use masking. In Keras, some layers support additional input that shows which tokens should be taken into account when training. To incorporate masking in our model, we can either include a separate Masking layer, or we can specify the `mask_zero=True` parameter of our Embedding layer.

In [66]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, embed_size, mask_zero=True),
    keras.layers.SimpleRNN(16),
    keras.layers.Dense(4, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),
             validation_data=ds_test.map(tupelize).batch(batch_size))



<keras.callbacks.History at 0x216c81814f0>

### LSTM: Long Short-Term Memory

One of the main problem of RNNs is vanishing gradients. RNNs can be pretty long, and may have a hard time propagating the gradients all the way back to the first layer of the network during backpropagation. When this happens, the network can't learn relationships between distant tokens. One way to avoid this problem is to introduce explicit state management by using gates. The two most common arhitectures that introduce gates are **LSTM** and **Gated Relay Unit** (GRU) 

In [None]:
model = keras.layers.Sequential([
    vectorizer,
    keras.layers.Embeddingg(vocab_size, embed_size),
    keras.layers.LSTM(8),
    keras.layers.Dense(4, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(8),
             validation_data=ds_test.map(tupelize).batch(8))

### Bidirectional and multilayer RNNs 

In our examples so far, the recurrent networks operate from the begining of a sequence until the end. For scenarios which require random access of the input sequence, it makes more sense to run the recurrent computation in both directions. RNNs that allow computations in both directions are called bidirectional RNNs, and they can be created by wrapping the recurrent layer with a special `Bidirectional` layer. 

> The Bidirectional layer makes two copies of the layer within it, and sets the go_backwards property of one of those copies to True, making it go in the opposite direction along the sequence

Recurrent networks, unidirectional or bidirectional, capture patterns within a sequence, and store them into state vectors or return them as output. As with convolutional networks, we can build another recurrent layer following the first one to capture higher level patterns, built from lower level patterns extracted by the first layer. This leads us to the notion of a **multi-layer RNN**, which consists of two or more recurrent networks, where the output of the previous layer is passed to the next layer as input.

Keras makes constructing these networks an easy task, because you just need to add more recurrent layers to the model. For all layers except the last one, we need to specify `return_sequences=True` parameter, because we need the layer to return all intermediate states, and not just the final state of the recurrent computation.

Let's build a two-layer bidirectional LSTM for our classification problem

In [None]:
model = keras.models.Sequential([
    vectorizer,
    keras.layers.Embedding(vocab_size, 128, mask_zero=True),
    keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True)),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(4, activation='softmax')
])

model.compile(loss='sparse_categorical_crossentropy', metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(batch_size),
                validation_data=ds_test.map(tupelize).batch(batch_size))

# Generate texts with recurrent networks

In [2]:
ds_train, ds_test = tfds.load('ag_news_subset').values()

## Building character vocabulary 

To build a character-level generative network, we need to split the text into individual characters instead of words.

In [5]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level=True,lower=False)
tokenizer.fit_on_texts([x['title'].numpy().decode('utf-8') for x in ds_train])

In [6]:
# Denote end-of-sequence with the token <eos>
eos_token = len(tokenizer.word_index)+1
tokenizer.word_index['<eos>'] = eos_token

vocab_size = eos_token + 1

In [7]:
tokenizer.texts_to_sequences(['Hello, world!'])

[[48, 2, 10, 10, 5, 44, 1, 25, 5, 8, 10, 13, 78]]

## Training a generative RNN to generate titles

To generate titles using an RNN, we will take one title as input, and for each input character in that title, we will train the network to generate the next character as output 

In [11]:
def title_batch(x):
    # Extract the actual text from the string tensor
    x = [t.numpy().decode('utf-8') for t in x]
    # Convert the list of strings into a list of integer tensors
    z = tokenizer.texts_to_sequences(x)
    # Pad those tensors to their maximum length
    z = tf.keras.preprocessing.sequence.pad_sequences(z)
    return tf.one_hot(z, vocab_size), tf.one_hot(tf.concat([z[:,1:], tf.constant(eos_token, shape=(len(z),1))], axis=1), vocab_size)

In [12]:
def title_batch_fn(x):
    x = x['title']
    a, b = tf.py_function(title_batch, inp=[x], Tout=(tf.float32, tf.float32))
    return a, b

In [13]:
model = keras.models.Sequential([
    keras.layers.Masking(input_shape=(None, vocab_size)),
    keras.layers.LSTM(128, return_sequences=True),
    keras.layers.Dense(vocab_size, activation='softmax')
])

model.summary()
model.compile(loss='categorical_crossentropy')

model.fit(ds_train.batch(8).map(title_batch_fn))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking_1 (Masking)          (None, None, 84)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, None, 128)         109056    
_________________________________________________________________
dense_1 (Dense)              (None, None, 84)          10836     
Total params: 119,892
Trainable params: 119,892
Non-trainable params: 0
_________________________________________________________________


<keras.callbacks.History at 0x113b9ec59d0>

## Generating output

We need to decode text represented by a sequence of token numbers. We could use the `tokenizer.sequences_to_texts` function, but ut does not work well with character-level tokenization. Therefore, we will take a dictionary of tokens from the tokenizer (called `word_index`), build a reverse map, and write our own decoding function

In [15]:
reverse_map = {val: key for key, val in tokenizer.word_index.items()}

def decode(x):
    return ''.join([reverse_map.get(t,'') for t in x])

Now, let's do the generation. We first encode a string passed as parameter into a sequence, and then on each step we call our network to infer the next character

The output of the network is a vector of `vocab_size` elements representing probabilities of each token, and we can find the most probably token number by using `argmax`. We then append this character to the generated list of tokens, and proceed with the generation. This proces of generating one character is repeated `size` times to generate the required number of characters, and we terminate early if `eos_token` is encountered

In [16]:
def generate(model, size=100, start='Today '):
    inp = tokenizer.texts_to_sequences([start])[0]
    chars = inp
    for i in range(size):
        out = model(tf.expand_dims(tf.one_hot(inp, vocab_size), 0))[0][-1]
        nc = tf.argmax(out)
        if nc == eos_token:
            break
        chars.append(nc.numpy())
        inp = inp+[nc]
    return decode(chars)

generate(model)

'Today #39;streat to return to chief contron in the to return to set to set to set to &lt;b&gt;...&lt;/b&gt'

## Sampling output during training

The only we can see that our model is getting better is by sampling generated strings during training. We use callbacks for that, which are functions that we can pass to the `fit` function, and that will be called periodically during training

In [17]:
sampling_callback = keras.callbacks.LambdaCallback(
    on_epoch_end = lambda batch, logs: print(generate(model))
)

model.fit(ds_train.batch(8).map(title_batch_fn), callbacks=[sampling_callback], epochs=3)

Epoch 1/3
Today #39;barght #39; for the strike to be of the consumer service
Epoch 2/3
Today #39;ease #39; control of the start to be search control
Epoch 3/3
Today #39;early #39; and start to be security #39;


<keras.callbacks.History at 0x113b9cb38b0>

## Soft text generation and temperature

In the `generate` function, we took the character with the highest probability as the next character in the generated text. This resulted in text that cycles between the same character sequences again and again

However, if wee look at the probability distribution for the next character, it may be that there are several high probabilities that are pretty similar. For example, when looking for the next character in the sequence 'play', it's similar likely that it's either space or e (as in the word player)

Therefore, it's not always the best choice to select the character with the absolute highest probability, choosing the second or third highest might still lead to meaningful text, and may avoid cycling through charcter sequences. Therefore, a better strategy is to sample characters from the probability distribution given by the network output

This sampling can be done using the `np.multinominal` function which implements a multinomial distribution. A function that implements this soft text generation is defined below

In [19]:
def generate_soft(model,size=100,start='Today ',temperature=1.0): 
    '''
        Temperature parameter indicates how strongly we should choose higher probability characters over
        lower probability ones. 
        if temp is close to 0, we choose the highest probability character, and when it approaches infinity then
        all probabilities become equal, and we randomly select the next character.
    '''
    inp = tokenizer.texts_to_sequences([start])[0]
    chars = inp
    for i in range(size):
        out = model(tf.expand_dims(tf.one_hot(inp,vocab_size),0))[0][-1]
        probs = tf.exp(tf.math.log(out)/temperature).numpy().astype(np.float64)
        probs = probs/np.sum(probs)
        nc = np.argmax(np.random.multinomial(1,probs,1))
        if nc==eos_token:
            break
        chars.append(nc)
        inp = inp+[nc]
    return decode(chars)

words = ['Today ','On Sunday ','Moscow, ','President ','Little red riding hood ']
    
for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"\n--- Temperature = {i}")
    for j in range(5):
        print(generate_soft(model,size=300,start=words[j],temperature=i))


--- Temperature = 0.3
Today PC Partners to State Start Cardinals (AP)
On Sunday #3;;Well Start of Windows Madrid Street (AP)
Moscow, S2 Wanted #39; Sale Against After With Market (AP)
President Symon Stratege Series for Take (AP)
Little red riding hood t candidate to be for its a security at China

--- Temperature = 0.8
Today RC Well Step Not Nekie #39; Coach Pensions Out Offensive (AP)
On Sunday #39;athens in holiday for N #39;s castats #39;
Moscow, RUS to retract shows death confidents in China classive &lt;b&gt;...&lt;/b&gt;
President Arman Net Schrouble #39; Bush Like Insilitics
Little red riding hood sin to cut at Phote fake debt change &lt;b&gt;...&lt;/b&gt;

--- Temperature = 1.0
Today onst overtown in't Sen Depution repeas - end scauds to Venney cleries (AFP)
On Sunday pacelitiar remerits (UCliac)
Moscow, HJPUNAS Hand Fieldreas Munch Web Iraq Summary
President wreken #39;s latest on &lt;b&gt;...&lt;/b&gt;
Little red riding hood on Preparsaving Fording firel condo

--- Temperat