In [1]:
%config IPCompleter.greedy = True
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
%load_ext tensorboard

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
import tensorflow as tf
import os
from datetime import datetime

pd.set_option('mode.chained_assignment', None)
sn.set(rc={'figure.figsize':(9,9)})
sn.set(font_scale=1.4)

# make results reproducible
seed = 0
np.random.seed(seed)

!pip install -q tensorflow-text

# Natural Language Processing

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Previously in history there were allot of manual rule based approaches to various NLP tasks, however modern problems are outperformed using machine learning specifically deep neural networks. These approaches learn natural language rules through analysis of large corpora (set of documents, with possibly human or computer annotations) of typical real-world examples. Deep neural network architectures can achieve state-of-the-art results in many natural language tasks, such as language modeling, parsing, and many others. Popular techniques include the use of word embeddings to capture semantic properties of words, and an increase in end-to-end learning of a higher-level task (e.g., question answering).

# Word Embeddings

Our existing models so far process numerical based data, however language consists of words so we need a way to convert or encode these (or sets of words, i.e. sentences or documents) into a representation that we can feed into our model. Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

Methods to generate this mapping include neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, explainable knowledge base method, and explicit representation in terms of the context in which words appear.

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP tasks such as syntactic parsing and sentiment analysis.

---

A naive approach to create a word embedding may be to attempt to take the possible seen vocabularly in the problem domain, and encode each word a "one-hot"encoded vector across our vocabulary. We could then concatenate the one-hot vectors for each word in our input. In practice this does not work as there are an estimated 13 million words in the English language, leading to a very large embedding space, which is mostly sparse zero and the vectors do not provide any notion of similarity between them. Instead we wish to find a smaller subspace that encodes the relationships between words.

Previous approaches in the past have bulit up a word co-occurence matrix and performed dimensionality reduction (Singular Value Decomposition) on it, to produce a dense vector for each word.

One key insight was [that](https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis) "a word is characterized by the company it keeps", which lead to approaches taking into account the surrounding words for the word that we are trying to learn the word embedding for. Modern approaches use neural network models, or more recently simple [log-bilinear models](https://nlp.stanford.edu/pubs/glove.pdf) to map words to a dense vector (word embedding). These models are trained by learning a representation of the word that encodes words in their context (i.e. the set of words in which they normally appear nearby to in text samples). These are often pre-trained (on large corpuses of data), and a selection these are:
* **Word2vec**: [2013](https://arxiv.org/abs/1301.3781), two dense layer neural network. Comes with two variations of training objective, continous bag-of-words (CBOW) best used for small datasets and Skip-Gram best for large datasets. Trained on 30 billion words on English Google News Corpus.
* **Glove**: [2014](https://nlp.stanford.edu/projects/glove/), (Global Vectors) is  essentially a log-bilinear model with a weighted least-squares objective. Few variants trained on from 6 billion words (Wikipedia 2014 + gigaword5), to crawl 840 billion words trained model.
* **FastText**: [2016](https://arxiv.org/abs/1607.04606),  based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations. Few variants trained on from 16 billion words (Wikipedia 2017 + UMBC + news dataset) to crawl 600 billion words trained model.

It is common to see word embeddings that are 8-dimensional (for small datasets), up to 1024-dimensions when working with large datasets. A higher dimensional embedding can capture fine-grained relationships between words, but takes more data to learn. In practice it is best to use a pre-trained model with the highest dimension available (if not fine tuning the parameters), and potentially a smaller dimension if fine tuning the parameters to our dataset. Another way to think of an embedding is as "lookup table". After these weights have been learned, we can encode each word by looking up the dense vector it corresponds to in the table. To create sentence embeddings or document embeddings from word embedding models it is standard practice to compute mean of the individual word embeddings for that set, to generate a embedding feature for our model.

As text embeddings depend on the distributional hypothesis (the surrounding words provide context) more recent approaches use **sentence embeddings** instead, such as:
* **ELMO**: [2018](https://arxiv.org/abs/1802.05365) (Embeddings from Language Models) uses character-based word representations and bidirectional LSTMs. Trained on 30 million sentences.
* **NNLM** (Recommended to use in practice) : [2018](https://tfhub.dev/google/nnlm-en-dim128/2) (Neural-Net Language Models) based on NNLM with three hidden layers. Token based text embedding trained on English Google News 200 billion corpus.

## Keras Embedding layer

Keras has a `tf.keras.layers.Embedding` layer, which makes it easy to use word embeddings. This layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter we can experiment with to see what works well for our problem, much in the same way we would experiment with the number of neurons in a Dense layer, however it is usually set for pre-trained Embedding layers. If we created an Embedding layer, the weights for the embedding layer are randomly initilized (just like our other layers). Either we use a pre-trained model, or they are gradually adjusted via backpropagation during training. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem our model is trained on).

[[1](https://www.tensorflow.org/tutorials/text/word_embeddings)]

If we pass an integer to the embedding layer, the result replaces each integer with the vector from the embedding table


In [2]:
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow_datasets as tfds

embedding_layer = layers.Embedding(1000, 5)
result = embedding_layer(tf.constant([1,2,3]))
result.numpy()

array([[ 0.01316785,  0.02323476, -0.01350605, -0.0314668 ,  0.04560835],
       [ 0.04549145,  0.02358145,  0.00343136, -0.0443067 , -0.04387503],
       [-0.00826237, -0.0350232 ,  0.02040035,  0.03441587,  0.01151514]],
      dtype=float32)

More formally the `tf.keras.layers.Embedding` layer takes the following arguments:
* `input_dim`: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
* `output_dim`: int >= 0. Dimension of the dense embedding.

And optional keyword arguments:
* `embeddings_initializer`: Initializer for the embeddings matrix.
* `embeddings_regularizer`: Regularizer function applied to the embeddings matrix.
* `embeddings_constraint`: Constraint function applied to the embeddings matrix.
* `mask_zero`: Whether or not the input value 0 is a special "padding" value that should be masked out.
* `input_length`: Length of input sequences, when it is constant.



For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. We could feed into the embedding layer above batches with shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15).

The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`.

In [3]:
result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))
result.shape

TensorShape([2, 3, 5])

In practice, we often use general pre-trained word embeddings for NLP problems, due to the large number of words required to train a good word embedding. We can load one of these from tensorflow hub.

In [4]:
import tensorflow_hub as hub

embedding_layer = hub.KerasLayer("https://tfhub.dev/google/nnlm-en-dim128/2",
                           input_shape=[], dtype=tf.string, trainable=False)

In [5]:
embedding_layer(["dog"])

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[-6.09472729e-02,  1.32798254e-02, -3.42817493e-02,
         4.26401682e-02, -1.43975914e-01,  1.62000328e-01,
        -2.93821730e-02, -2.00032350e-02, -7.07213208e-02,
         1.63512006e-02,  9.08970013e-02, -4.92882654e-02,
         3.67779247e-02, -3.68761532e-02,  1.69035912e-01,
        -1.26325879e-02, -1.02184914e-01, -4.70518619e-02,
        -2.19973363e-02, -8.74752700e-02, -5.23869321e-02,
         3.40459906e-02,  3.22821885e-02, -3.63609865e-02,
         1.39977902e-01,  3.29261534e-02,  3.83867393e-03,
        -5.63085563e-02, -5.25179058e-02, -1.48635376e-02,
         6.00850163e-03, -1.24928812e-02,  2.53655901e-03,
         1.05012886e-01, -7.33386502e-02, -1.16906567e-02,
        -5.32731973e-02, -8.08588266e-02,  1.42390028e-01,
         1.20618619e-01,  6.00544587e-02,  8.53905752e-02,
         2.86061447e-02, -6.75920993e-02, -1.22930340e-01,
        -1.01479821e-01, -9.18542147e-02,  1.24494404e-01,
      

As this is a sentence embedding layer, we can also turn sentences into embeddings easily too

In [6]:
embedding_layer(["The dog played with the cat"])

<tf.Tensor: shape=(1, 128), dtype=float32, numpy=
array([[ 2.82348961e-01,  5.27088307e-02, -5.94306886e-02,
         1.92900240e-01, -1.83743060e-01,  1.46496981e-01,
         9.77013707e-02, -1.07290536e-01, -2.57747229e-02,
         4.41485010e-02,  1.70615181e-01, -1.00377649e-01,
         9.42871347e-03,  3.02134082e-02,  1.34187490e-01,
         5.60502484e-02, -1.07281059e-01, -3.25787924e-02,
        -8.48783329e-02,  4.67057377e-01, -6.27068579e-02,
         1.32775724e-01,  4.04574051e-02,  1.63751058e-02,
         1.32634431e-01,  2.57065725e-02, -7.29926378e-02,
         7.19335005e-02,  1.11830607e-01, -2.11250614e-02,
         3.62873152e-02,  4.42225672e-02, -4.65986840e-02,
         5.73276244e-02, -7.11966679e-02, -5.71498871e-02,
        -1.18366979e-01, -1.10630527e-01, -3.84329781e-02,
        -4.61224578e-02,  6.81087002e-02,  1.19720802e-01,
        -1.97505840e-04, -2.34259702e-02, -2.73194835e-02,
        -1.29417166e-01,  4.85802516e-02,  1.24888368e-01,
      

### How to understand and use Embeddings

We can treat these text embeddings as vectors, where the Euclidean distance (or cosine similarity) between text vectors provides an effective method for measuring the linguistic or semantic similarity of the corresponding text. Sometimes nearest neighbours on a sample text can reveal similar texts. I.e. if we compute the word embedding for frog, we would commonly see other close word embeddings such as frog, frogs, toad, litoria, lizard etc.

The embeddings also have linear substructures to them, enabling vector differences, i.e. let $E(w)$ be the embedding for word $w$, then we will see $E(woman)- E(man) + E(king) \approx E(queen)$

# Working with text data

## Tokenization

Tokenization is the process of breaking up a string into tokens. Commonly, these tokens are words, numbers, or punctuation.

The main interfaces are `Tokenizer` and `TokenizerWithOffsets` which each have a single method `tokenize` and `tokenize_with_offsets` respectively. There are multiple tokenizers available now. Each of these implement `TokenizerWithOffsets` (which extends `Tokenizer`) which includes an option for getting byte offsets into the original string. This allows the caller to know the bytes in the original string the token was created from.

All of the tokenizers return *RaggedTensors* (TensorFlow equivalent of nested variable-length lists) with the inner-most dimension of tokens mapping to the original individual strings.

### WhitespaceTokenizer
This is a basic tokenizer that splits `UTF-8` strings on International Components for Unicode (ICU) defined whitespace characters (eg. space, tab, new line).


Most tensorflow operations expect strings to be in `UTF-8` format.

In [7]:
import tensorflow_text as text
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['The cat jumped up and suprised the dog!'])
print(tokens.to_list())

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


[[b'The', b'cat', b'jumped', b'up', b'and', b'suprised', b'the', b'dog!']]


### UnicodeScriptTokenizer
This tokenizer splits UTF-8 strings based on Unicode ICU script boundaries.

In practice, this is similar to the `WhitespaceTokenizer` with the most apparent difference being that it will split punctuation (USCRIPT_COMMON) from language texts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language texts from each other.

In [8]:
tokenizer = text.UnicodeScriptTokenizer()
tokens = tokenizer.tokenize(['The cat jumped up and suprised the dog!'])
print(tokens.to_list())

[[b'The', b'cat', b'jumped', b'up', b'and', b'suprised', b'the', b'dog', b'!']]


### Unicode split

When tokenizing languages without whitespace to segment words, it is common to just split by character, which can be accomplished using the unicode_split operation found in core.

In [9]:
tokens = tf.strings.unicode_split([u"仅今年前".encode('UTF-8')], 'UTF-8')
print(tokens.to_list())

[[b'\xe4\xbb\x85', b'\xe4\xbb\x8a', b'\xe5\xb9\xb4', b'\xe5\x89\x8d']]


### Offsets

When tokenizing strings, it is often desired to know where in the original string the token originated from. For this reason, each tokenizer which implements `TokenizerWithOffsets` has a *tokenize_with_offsets* method that will return the byte offsets along with the tokens. The offset_starts lists the bytes in the original string each token starts at, and the offset_limits lists the bytes where each token ends.

In [10]:
tokenizer = text.UnicodeScriptTokenizer()
(tokens, offset_starts, offset_limits) = tokenizer.tokenize_with_offsets(
    ['The cat jumped up and suprised the dog!'])
print(tokens.to_list())
print(offset_starts.to_list())
print(offset_limits.to_list())

[[b'The', b'cat', b'jumped', b'up', b'and', b'suprised', b'the', b'dog', b'!']]
[[0, 4, 8, 15, 18, 22, 31, 35, 38]]
[[3, 7, 14, 17, 21, 30, 34, 38, 39]]


### Working with TensorFlow Datasets

Tokenizers work as expected with the `tf.data.Dataset` object. E.g.

In [11]:
docs = tf.data.Dataset.from_tensor_slices(
    [['The dog was pretty chilled'], ["He was a cool cat!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
iterator = iter(tokenized_docs)
print(next(iterator).to_list())
print(next(iterator).to_list())

[[b'The', b'dog', b'was', b'pretty', b'chilled']]
[[b'He', b'was', b'a', b'cool', b'cat!']]


### N-grams & Sliding Window

N-grams are sequential words given a sliding window size of $n$. When combining the tokens, there are three reduction mechanisms supported. For text, you would want to use `Reduction.STRING_JOIN` which appends the strings to each other. The default separator character is a space, but this can be changed with the string_separater argument.

The other two reduction methods are most often used with numerical values, and these are `Reduction.SUM` and `Reduction.MEAN`.

In [12]:
tokenizer = text.WhitespaceTokenizer()
tokens = tokenizer.tokenize(['Dog befriended the cat'])

# Ngrams, in this case bi-gram (n = 2), tri-gram (n = 3)
unigrams = text.ngrams(tokens, 1, reduction_type=text.Reduction.STRING_JOIN)
print('Uni-gram: ', unigrams.to_list())
bigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)
print('Bi-gram: ', bigrams.to_list())
trigrams = text.ngrams(tokens, 3, reduction_type=text.Reduction.STRING_JOIN)
print('Tri-gram: ', trigrams.to_list())

Uni-gram:  [[b'Dog', b'befriended', b'the', b'cat']]
Bi-gram:  [[b'Dog befriended', b'befriended the', b'the cat']]
Tri-gram:  [[b'Dog befriended the', b'befriended the cat']]


# Complete Text Classification Example

In this example we'll use three different English translations of the same work, Homer's Illiad, and train a model to identify the translator given a single line of text (Classify the text). This is a fully worked example where we load the dataset from text files, process them, build our own word embedding and train a simple model for classification of the text.

## Loading text data in TensorFlow

Let us look at an example of how to use `tf.data.TextLineDataset` to load examples from text files. `TextLineDataset` is designed to create a dataset from a text file, in which each example is a line of text from the original file. This is potentially useful for any text data that is primarily line-based (for example, poetry or error logs).

In [13]:
# Download the files locally
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
    text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

## Load the text into TensorFlow datasets

Iterate through the files, loading each one into its own dataset.

Each example needs to be individually labeled, so use `tf.data.Dataset.map` to apply a labeller function to each one. This will iterate over every example in the dataset, returning (`example, label`) pairs.

In [14]:
def labeler(example, index):
    return example, tf.cast(index, tf.int64)


labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(
        os.path.join(parent_dir, file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

Combine these labeled datasets into a single dataset, and shuffle it.


In [15]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

We can use `tf.data.Dataset.take` and `print` to see what the `(example, label)` pairs look like. The `numpy` property shows each Tensor's value.

In [16]:
for example, label in all_labeled_data.take(5):
    print('Label {} : '.format(label), example)

Label 1 :  tf.Tensor(b'extremity of the danger. Agamemnon proposes to make their escape by', shape=(), dtype=string)
Label 1 :  tf.Tensor(b"Are driv'n; among them Ajax spreads dismay,", shape=(), dtype=string)
Label 0 :  tf.Tensor(b'To gather it from all the Greeks again.', shape=(), dtype=string)
Label 0 :  tf.Tensor(b'Nought trusted he, but with an iron mace', shape=(), dtype=string)
Label 0 :  tf.Tensor(b'By Priameian Hector, fierce as Mars,', shape=(), dtype=string)


### Build a vocabulary

First, build a vocabulary by tokenizing the text into a collection of individual unique words. There are a few ways to do this in both TensorFlow and Python. Here we will:

1. Iterate over each example's `numpy` value.
2. Use `tfds.features.text.Tokenizer` to split it into tokens.
3. Collect these tokens into a Python set, to remove duplicates.
4. Get the size of the vocabulary for later use.

In [17]:
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
    some_tokens = tokenizer.tokenize(text_tensor.numpy())
    vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
print('Vocab size: ', vocab_size)

Vocab size:  17178


### Encode the examples

Create an encoder by passing the `vocabulary_set` to `tfds.features.text.TokenTextEncoder`. The encoder's `encode` method takes in a string of text and returns a list of integers.

In [18]:
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

For example this takes an input like the one below and processes it to

In [19]:
example_text = next(iter(all_labeled_data))[0].numpy()
encoded_example = encoder.encode(example_text)
print('Raw Example : ', example_text)
print('Encoded Example: ', encoded_example)

Raw Example :  b'extremity of the danger. Agamemnon proposes to make their escape by'
Encoded Example:  [5057, 10197, 6221, 164, 2273, 841, 9758, 4349, 15865, 11355, 10608]


Note, that we can also decode our encoding, i.e.

In [20]:
print('Decoded Example: ', encoder.decode(encoded_example))

Decoded Example:  extremity of the danger Agamemnon proposes to make their escape by


Now we can run the encoder on the dataset by wrapping it in `tf.py_function` and  passing that to the dataset's `map` method.

In [21]:
def encode(text_tensor, label):
    encoded_text = encoder.encode(text_tensor.numpy())
    return encoded_text, label

We want to use `Dataset.map` to apply this function to each element of the dataset.  `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode we can only use TensorFlow Ops and functions. 

So we can't `.map` this function directly: We need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

In [22]:
def encode_map_fn(text, label):
    # py_func doesn't set the shape of the returned tensors.
    encoded_text, label = tf.py_function(encode,
                                         inp=[text, label],
                                         Tout=(tf.int64, tf.int64))

    # `tf.data.Datasets` work best if all components have a shape set
    #  so set the shapes manually:
    encoded_text.set_shape([None])
    label.set_shape([])

    return encoded_text, label


all_encoded_data = all_labeled_data.map(encode_map_fn)

## Split the dataset into test and train batches

We can use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to create a small test dataset and a larger training set.

Before being passed into the model, the datasets need to be batched. Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [23]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))

Now, `test_data` and `train_data` are not collections of (`example, label`) pairs, but collections of batches. Each batch is a pair of (*many examples*, *many labels*) represented as arrays.

For example:

In [24]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

(<tf.Tensor: shape=(14,), dtype=int64, numpy=
 array([ 5057, 10197,  6221,   164,  2273,   841,  9758,  4349, 15865,
        11355, 10608,     0,     0,     0])>,
 <tf.Tensor: shape=(), dtype=int64, numpy=1>)

Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one.

In [25]:
vocab_size += 1

## Build the model


In [26]:
model = tf.keras.Sequential()

The first layer converts integer representations to dense vector embeddings. Here we will use the embedding layer, with randomly initilized values which we will train during our training. (Note in practice we would often use a pre-trained word embedding as described above, or use a pre-trained word embedding model and fine tune it during training).

In [27]:
model.add(tf.keras.layers.Embedding(vocab_size, 64))

The next layer is a Long Short-Term Memory layer, which lets the model understand words in their context with other words. A bidirectional wrapper on the LSTM helps it to learn about the datapoints in relationship to the datapoints that came before it and after it. (We will go into more detail in an upcoming video on the LSTM).

In [28]:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [29]:
# One or more dense layers.
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(3))
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 64)          1099456   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 195       
Total params: 1,178,115
Trainable params: 1,178,115
Non-trainable params: 0
_________________________________________________________________


Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. We will use our popular `Adam` optomizer.

In [30]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## Train the model

This model running on this data produces decent results

In [31]:
history = model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [32]:
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEvaluation loss: {:.3f}, \nEvaluation accuracy: {:.3f}'.format(eval_loss, eval_acc))

     79/Unknown - 2s 27ms/step - loss: 0.3997 - accuracy: 0.8344
Evaluation loss: 0.400, 
Evaluation accuracy: 0.834
