<br>

<div align=center><font color=maroon size=6><b>Load text</b></font></div>

<br>

<font size=4><b>References:</b></font>
1. TF2 official tutorials: <a href="https://www.tensorflow.org/tutorials" style="text-decoration:none;">TensorFlow Tutorials</a> 
    * `TensorFlow > Learn > TensorFlow Core > `Tutorials > <a href="https://www.tensorflow.org/tutorials/load_data/text" style="text-decoration:none;">Load text </a>
        * Run in <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/text.ipynb" style="text-decoration:none;">Google Colab</a>

<br>
<br>
<br>

This tutorial demonstrates two ways to load and preprocess text.

- First, you will use Keras utilities and preprocessing layers. These include `tf.keras.utils.text_dataset_from_directory` to turn data into a `tf.data.Dataset` and `tf.keras.layers.TextVectorization` for data standardization, tokenization, and vectorization. If you are new to TensorFlow, you should start with these.


- Then, you will use lower-level utilities like `tf.data.TextLineDataset` to load text files, and [TensorFlow Text](https://www.tensorflow.org/text) APIs, such as `text.UnicodeScriptTokenizer` and `text.case_fold_utf8`, to preprocess the data for finer-grain control.

In [1]:
!pip install "tensorflow-text==2.8.*"

Collecting tensorflow-text==2.8.*
  Downloading tensorflow_text-2.8.2-cp39-cp39-win_amd64.whl (2.5 MB)
Collecting tensorflow<2.9,>=2.8.0
  Downloading tensorflow-2.8.0-cp39-cp39-win_amd64.whl (438.0 MB)
Collecting tf-estimator-nightly==2.8.0.dev2021122109
  Downloading tf_estimator_nightly-2.8.0.dev2021122109-py2.py3-none-any.whl (462 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1
  Downloading tensorflow_io_gcs_filesystem-0.25.0-cp39-cp39-win_amd64.whl (1.5 MB)
Collecting libclang>=9.0.1
  Downloading libclang-14.0.1-py2.py3-none-win_amd64.whl (14.2 MB)
Collecting keras<2.9,>=2.8.0rc0
  Downloading keras-2.8.0-py2.py3-none-any.whl (1.4 MB)
Collecting tensorboard<2.9,>=2.8
  Downloading tensorboard-2.8.0-py3-none-any.whl (5.8 MB)
Installing collected packages: tf-estimator-nightly, tensorflow-io-gcs-filesystem, tensorboard, libclang, keras, tensorflow, tensorflow-text
  Attempting uninstall: tensorboard
    Found existing installation: tensorboard 2.5.0
    Uninstalling tensorboar

<br>

In [1]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import TextVectorization


import collections
import pathlib

In [2]:
import tensorflow_datasets as tfds
import tensorflow_text as tf_text

  from .autonotebook import tqdm as notebook_tqdm


<br>

In [3]:
tf.__version__

'2.8.0'

In [4]:
tf.test.is_gpu_available()

Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.


True

<br>
<br>

## Example 1: Predict the tag for a Stack Overflow question

As a first example, you will download a dataset of programming questions from Stack Overflow. Each question (_"How do I sort a dictionary by value?"_) is labeled with exactly one tag (`Python`, `CSharp`, `JavaScript`, or `Java`). Your task is to develop a model that predicts the tag for a question. This is an example of multi-class classification—an important and widely applicable kind of machine learning problem.

<br>

### Download and explore the dataset

Begin by downloading the Stack Overflow dataset using `tf.keras.utils.get_file`, and exploring the directory structure:

In [5]:
#data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
#
#dataset_dir = utils.get_file(
#    origin=data_url,
#    untar=True,
#    cache_dir='stack_overflow',
#    cache_subdir='')
#
#dataset_dir = pathlib.Path(dataset_dir).parent
#
#dataset_dir
#
#输出：WindowsPath('/tmp/.keras')


In [6]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'

dataset_dir = utils.get_file(origin=data_url,
                             untar=True,
                             cache_dir='D:/KeepStudy/0_Coding/0_dataset',
                             cache_subdir='stack_overflow')

dataset_dir = pathlib.Path(dataset_dir).parent

In [7]:
dataset_dir

WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow')

In [8]:
list(dataset_dir.iterdir())

[WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/README.md'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/stack_overflow_16k.tar.gz'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/test'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/train')]

In [9]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/train/csharp'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/train/java'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/train/javascript'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/stack_overflow/train/python')]

<br>

The `train/csharp`, `train/java`, `train/python` and `train/javascript` directories contain many text files, each of which is a Stack Overflow question.

Print an example file and inspect the data:

In [10]:
sample_file = train_dir/'python/1755.txt'

with open(sample_file) as f:
    print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x



<br>

### Load the dataset

Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the `tf.keras.utils.text_dataset_from_directory` utility to create a labeled `tf.data.Dataset`. If you're new to `tf.data`, it's a powerful collection of tools for building input pipelines. (Learn more in the [tf.data: Build TensorFlow input pipelines](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb) guide.)

The `tf.keras.utils.text_dataset_from_directory` API expects a directory structure as follows:

```
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
```

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: [training](https://developers.google.com/machine-learning/glossary#training_set), [validation](https://developers.google.com/machine-learning/glossary#validation_set), and [test](https://developers.google.com/machine-learning/glossary#test-set).

The Stack Overflow dataset has already been divided into training and test sets, but it lacks a validation set.

Create a validation set using an 80:20 split of the training data by using `tf.keras.utils.text_dataset_from_directory` with `validation_split` set to `0.2` (i.e. 20%):

In [11]:
batch_size = 32
seed = 42

raw_train_ds = utils.text_dataset_from_directory(train_dir,
                                                 batch_size=batch_size,
                                                 validation_split=0.2,
                                                 subset='training',
                                                 seed=seed)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [12]:
type(raw_train_ds)

tensorflow.python.data.ops.dataset_ops.BatchDataset

<br>

As the previous cell output suggests, there are 8,000 examples in the training folder, of which you will use 80% (or 6,400) for training. You will learn in a moment that you can train a model by passing a `tf.data.Dataset` directly to `Model.fit`.

First, iterate over the dataset and print out a few examples, to get a feel for the data.

**Note**: To increase the difficulty of the classification problem, the dataset author replaced occurrences of the words *Python*, *CSharp*, *JavaScript*, or *Java* in the programming question with the word *blank*.

In [13]:
for text_batch, label_batch in raw_train_ds.take(1):
    print('shape of batch: ', text_batch.shape)
    for i in range(10):
        print("Question: ", text_batch.numpy()[i])
        print("Label:", label_batch.numpy()[i])
        print()

shape of batch:  (32,)
Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it on

<br>

The labels are `0`, `1`, `2` or `3`. To check which of these correspond to which string label, you can inspect the `class_names` property on the dataset:


In [14]:
for i, label in enumerate(raw_train_ds.class_names):
    print("Label", i, "corresponds to", label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


<br>

Next, you will create a validation and a test set using `tf.keras.utils.text_dataset_from_directory`. You will use the remaining 1,600 reviews from the training set for validation.

**Note**:  When using the `validation_split` and `subset` arguments of `tf.keras.utils.text_dataset_from_directory`, make sure to either specify a random seed or pass `shuffle=False`, so that the validation and training splits have no overlap.

In [15]:
# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(train_dir,
                                               batch_size=batch_size,
                                               validation_split=0.2,
                                               subset='validation',
                                               seed=seed)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [16]:
test_dir = dataset_dir/'test'

raw_test_ds = utils.text_dataset_from_directory(test_dir,
                                                batch_size=batch_size
                                                )

Found 8000 files belonging to 4 classes.


<br>
<br>

### Prepare the dataset for training

Next, you will standardize, tokenize, and vectorize the data using the `tf.keras.layers.TextVectorization` layer.

- <font color=blue>**_Standardization_**</font> refers to preprocessing the text, typically to remove punctuation or HTML elements to simplify the dataset.
- <font color=blue>**_Tokenization_**</font> refers to splitting strings into tokens (for example, splitting a sentence into individual words by splitting on whitespace).
- <font color=blue>**_Vectorization_**</font> refers to converting tokens into numbers so they can be fed into a neural network.

All of these tasks can be accomplished with this layer. (You can learn more about each of these in the `tf.keras.layers.TextVectorization` API docs.)



<font color=maroon size=3>**Note that:**</font>

- The default standardization converts text to lowercase and removes punctuation (`standardize='lower_and_strip_punctuation'`).
- The default tokenizer splits on whitespace (`split='whitespace'`).
- The default vectorization mode is `'int'` (`output_mode='int'`). This outputs integer indices (one per token). This mode can be used to build models that take word order into account. You can also use other modes—like `'binary'`—to build [bag-of-words](https://developers.google.com/machine-learning/glossary#bag-of-words) models.

You will build two models to learn more about standardization, tokenization, and vectorization with `TextVectorization`:

- First, you will use the `'binary'` vectorization mode to build a bag-of-words model.
- Then, you will use the `'int'` mode with a 1D ConvNet.

In [17]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(max_tokens=VOCAB_SIZE,
                                           output_mode='binary')

In [18]:
binary_vectorize_layer

<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x209cb3a5e20>

<br>

For the `'int'` mode, in addition to maximum vocabulary size, you need to set an explicit maximum sequence length (`MAX_SEQUENCE_LENGTH`), which will cause the layer to pad or truncate sequences to exactly `output_sequence_length` values:

In [19]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(max_tokens=VOCAB_SIZE,
                                        output_mode='int',
                                        output_sequence_length=MAX_SEQUENCE_LENGTH)

In [20]:
int_vectorize_layer

<keras.layers.preprocessing.text_vectorization.TextVectorization at 0x209e81da400>

<br>

<font size=3 color=maroon>Next, call `TextVectorization.adapt` to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.

**Note:** It's important to only use your training data when calling `TextVectorization.adapt`, as using the test set would leak information.</font>

In [21]:
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)

binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

Print the result of using these layers to preprocess data:

In [22]:
def binary_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    
    return binary_vectorize_layer(text), label

In [23]:
def int_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    
    return int_vectorize_layer(text), label

In [24]:
# Retrieve a batch (of 32 reviews and labels) from the dataset.
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print("Question", first_question)
print()
print("Label", first_label)

Question tf.Tensor(b'"what is the difference between these two ways to create an element? var a = document.createelement(\'div\');..a.id = ""mydiv"";...and..var a = document.createelement(\'div\').id = ""mydiv"";...what is the difference between them such that the first one works and the second one doesn\'t?"\n', shape=(), dtype=string)

Label tf.Tensor(2, shape=(), dtype=int32)


<br>

In [25]:
print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 0. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


<br>

In [26]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[ 55   6   2 410 211 229 121 895   4 124  32 245  43   5   1   1   5   1
    1   6   2 410 211 191 318  14   2  98  71 188   8   2 199  71 178   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0


<br>

<font size=3><font color=maroon>As shown above, `TextVectorization`'s `'binary'` mode returns an array denoting which tokens exist at least once in the input, while the `'int'` mode replaces each token by an integer, thus preserving their order.</font>

You can lookup the token (string) that each integer corresponds to by calling `TextVectorization.get_vocabulary` on the layer:</font>

In [27]:
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

1289 --->  roman
313 --->  source
Vocabulary size: 10000


<br>

You are nearly ready to train your model.

As a final preprocessing step, you will apply the `TextVectorization` layers you created earlier to the training, validation, and test sets:

In [28]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

<br>
<br>

### Configure the dataset for performance

These are two important methods you should use when loading data to make sure that I/O does not become blocking.

- `Dataset.cache` keeps data in memory after it's loaded off disk. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache, which is more efficient to read than many small files.
- `Dataset.prefetch` overlaps data preprocessing and model execution while training.

You can learn more about both methods, as well as how to cache data to disk in the *Prefetching* section of the [Better performance with the tf.data API](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/data_performance.ipynb) guide.

In [29]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [30]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

<br>
<br>

### Train the model

It's time to create your neural network.

For the `'binary'` vectorized data, define a simple bag-of-words linear model, then configure and train it:

In [31]:
binary_model = tf.keras.Sequential([layers.Dense(4)])

binary_model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
                     optimizer='adam',
                     metrics=['accuracy'])

history = binary_model.fit(binary_train_ds, validation_data=binary_val_ds, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<br>

Next, you will use the `'int'` vectorized layer to build a 1D ConvNet:

In [32]:
help(layers.Embedding)

Help on class Embedding in module keras.layers.embeddings:

class Embedding(keras.engine.base_layer.Layer)
 |  Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs)
 |  
 |  Turns positive integers (indexes) into dense vectors of fixed size.
 |  
 |  e.g. `[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]`
 |  
 |  This layer can only be used on positive integer inputs of a fixed range. The
 |  `tf.keras.layers.TextVectorization`, `tf.keras.layers.StringLookup`,
 |  and `tf.keras.layers.IntegerLookup` preprocessing layers can help prepare
 |  inputs for an `Embedding` layer.
 |  
 |  This layer accepts `tf.Tensor` and `tf.RaggedTensor` inputs. It cannot be
 |  called with `tf.SparseTensor` input.
 |  
 |  Example:
 |  
 |  >>> model = tf.keras.Sequential()
 |  >>> model.add(tf.keras.layers.Embedding(1000, 64, input_length=10))
 |  >>> # The model will t

In [33]:
def create_model(vocab_size, num_labels):
    model = tf.keras.Sequential([layers.Embedding(vocab_size, 64, mask_zero=True),
                                 layers.Conv1D(64, 5, 
                                               padding="valid", 
                                               activation="relu", 
                                               strides=2),
                                 layers.GlobalMaxPooling1D(),
                                 layers.Dense(num_labels)
                                ])
    return model

In [34]:
# `vocab_size` is `VOCAB_SIZE + 1` since `0` is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4)

int_model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=True),
                  optimizer='adam',
                  metrics=['accuracy']
                  )

history = int_model.fit(int_train_ds, validation_data=int_val_ds, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<br>

Compare the two models:

In [35]:
print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 4)                 40004     
                                                                 
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None


<br>

In [36]:
print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          640064    
                                                                 
 conv1d (Conv1D)             (None, None, 64)          20544     
                                                                 
 global_max_pooling1d (Globa  (None, 64)               0         
 lMaxPooling1D)                                                  
                                                                 
 dense_1 (Dense)             (None, 4)                 260       
                                                                 
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None


<br>

Evaluate both models on the test data:

In [37]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy: {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.44%
Int model accuracy: 80.26%


<br>

<font color=maroon size=3>**Note:** This example dataset represents a rather simple classification problem. More complex datasets and problems bring out subtle but significant differences in preprocessing strategies and model architectures. Be sure to try out different hyperparameters and epochs to compare various approaches.</font>

<br>
<br>

### Export the model

In the code above, you applied `tf.keras.layers.TextVectorization` to the dataset before feeding text to the model. If you want to make your model capable of processing raw strings (for example, to simplify deploying it), you can include the `TextVectorization` layer inside your model.

To do so, you can create a new model using the weights you have just trained:

In [38]:
export_model = tf.keras.Sequential([binary_vectorize_layer, 
                                    binary_model,
                                    layers.Activation('sigmoid')
                                   ])

export_model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                     optimizer='adam',
                     metrics=['accuracy']
                    )

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(binary_accuracy))

Accuracy: 81.44%


<br>

Now, your model can take raw strings as input and predict a score for each label using `Model.predict`. Define a function to find the label with the maximum score:

In [39]:
def get_string_labels(predicted_scores_batch):
    predicted_int_labels = tf.math.argmax(predicted_scores_batch, axis=1)
    predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
    
    return predicted_labels

In [40]:
help(tf.gather)

Help on function gather_v2 in module tensorflow.python.ops.array_ops:

gather_v2(params, indices, validate_indices=None, axis=None, batch_dims=0, name=None)
    Gather slices from params axis `axis` according to indices. (deprecated arguments)
    
    Instructions for updating:
    The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
    
    Gather slices from `params` axis `axis` according to `indices`.  `indices`
    must be an integer tensor of any dimension (often 1-D).
    
    `Tensor.__getitem__` works for scalars, `tf.newaxis`, and
    [python slices](https://numpy.org/doc/stable/reference/arrays.indexing.html#basic-slicing-and-indexing)
    
    `tf.gather` extends indexing to handle tensors of indices.
    
    In the simplest case it's identical to scalar indexing:
    
    >>> params = tf.constant(['p0', 'p1', 'p2', 'p3', 'p4', 'p5'])
    >>> params[3].numpy()
    b'p3'
    >>> tf.gather(params, 3).numpy()
    b'p3

<br>
<br>

### Run inference on new data

In [41]:
inputs = [ "how do I extract keys from a dict into a list?",      # 'python'
           "debug public static void main(string[] args) {...}",  # 'java'
         ]

predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)

for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label.numpy())
    print()

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'

Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'



Including the text preprocessing logic inside your model enables you to export a model for production that simplifies deployment, and reduces the potential for [train/test skew](https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew).
<br>
<br>

<font size=3 color=maroon>There is a performance difference to keep in mind when choosing where to apply `tf.keras.layers.TextVectorization`. 
- Using it outside of your model enables you to do asynchronous CPU processing and buffering of your data when training on GPU. So, if you're training your model on the GPU, you probably want to go with this option to get the best performance while developing your model, 
- then switch to including the `TextVectorization` layer inside your model when you're ready to prepare for deployment.</font>
<br>

Visit the [Save and load models](../keras/save_and_load.ipynb) tutorial to learn more about saving models.

<br>
<br>
<br>

## Example 2: Predict the author of Iliad translations

The following provides an example of using `tf.data.TextLineDataset` to load examples from text files, and [TensorFlow Text](https://www.tensorflow.org/text) to preprocess the data. You will use three different English translations of the same work, Homer's Iliad, and train a model to identify the translator given a single line of text.

### Download and explore the dataset

The texts of the three translations are by:

- [William Cowper](https://en.wikipedia.org/wiki/William_Cowper): [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt)
- [Edward, Earl of Derby](https://en.wikipedia.org/wiki/Edward_Smith-Stanley,_14th_Earl_of_Derby): [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt)
- [Samuel Butler](https://en.wikipedia.org/wiki/Samuel_Butler_%28novelist%29): [text](https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt)

The text files used in this tutorial have undergone some typical preprocessing tasks like removing document header and footer, line numbers and chapter titles.

Download these lightly munged files locally:

In [42]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
    text_dir = utils.get_file(name,
                              origin=DIRECTORY_URL + name,
                              cache_dir='D:/KeepStudy/0_Coding/0_dataset',
                              cache_subdir='Iliad_translations')

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

[WindowsPath('D:/KeepStudy/0_Coding/0_dataset/Iliad_translations/butler.txt'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/Iliad_translations/cowper.txt'),
 WindowsPath('D:/KeepStudy/0_Coding/0_dataset/Iliad_translations/derby.txt')]

<br>

### Load the dataset

Previously, with `tf.keras.utils.text_dataset_from_directory` all contents of a file were treated as a single example. Here, you will use `tf.data.TextLineDataset`, which is designed to create a `tf.data.Dataset` from a text file where each example is a line of text from the original file. `TextLineDataset` is useful for text data that is primarily line-based (for example, poetry or error logs).

Iterate through these files, loading each one into its own dataset. Each example needs to be individually labeled, so use `Dataset.map` to apply a labeler function to each one. This will iterate over every example in the dataset, returning (`example, label`) pairs.

In [43]:
def labeler(example, index):
    return (example, tf.cast(index, tf.int64))

In [44]:
labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

In [45]:
labeled_data_sets

[<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 <MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>,
 <MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>]

<br>

<font color=maroon size=3>Next, you'll combine these labeled datasets into a single dataset using `Dataset.concatenate`, and shuffle it with `Dataset.shuffle`:</font>

In [46]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [47]:
all_labeled_data = labeled_data_sets[0]
all_labeled_data

<MapDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

In [48]:
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False)

In [49]:
all_labeled_data

<ShuffleDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

<br>

Print out a few examples as before. The dataset hasn't been batched yet, hence each entry in `all_labeled_data` corresponds to one data point:

In [50]:
for text, label in all_labeled_data.take(10):
    print(text)
    print("Sentence: ", text.numpy())
    print("Label:", label.numpy())
    print(text.shape)
    print()
    
# 下面的输出中 label 是无序的，因为被 shuffle 过了

tf.Tensor(b'To Phthia, her whom thou shalt most approve.', shape=(), dtype=string)
Sentence:  b'To Phthia, her whom thou shalt most approve.'
Label: 0
()

tf.Tensor(b'Their costly raiment, while the land had rest,', shape=(), dtype=string)
Sentence:  b'Their costly raiment, while the land had rest,'
Label: 0
()

tf.Tensor(b'Right on, but smitten by some dauntless youth', shape=(), dtype=string)
Sentence:  b'Right on, but smitten by some dauntless youth'
Label: 0
()

tf.Tensor(b'Thy stronger far, and dearer to the Gods?', shape=(), dtype=string)
Sentence:  b'Thy stronger far, and dearer to the Gods?'
Label: 1
()

tf.Tensor(b'Withheld his aid; but close beside her foot', shape=(), dtype=string)
Sentence:  b'Withheld his aid; but close beside her foot'
Label: 1
()

tf.Tensor(b'towards Troy, for he did not think that any of the immortals would go', shape=(), dtype=string)
Sentence:  b'towards Troy, for he did not think that any of the immortals would go'
Label: 2
()

tf.Tensor(b'Antilochus

<br>
<br>

### Prepare the dataset for training

Instead of using `tf.keras.layers.TextVectorization` to preprocess the text dataset, you will now use the TensorFlow Text APIs to standardize and tokenize the data, build a vocabulary and use `tf.lookup.StaticVocabularyTable` to map tokens to integers to feed to the model. (Learn more about [TensorFlow Text](https://www.tensorflow.org/text)).

Define a function to convert the text to lower-case and tokenize it:

- TensorFlow Text provides various tokenizers. In this example, you will use the `text.UnicodeScriptTokenizer` to tokenize the dataset.
- You will use `Dataset.map` to apply the tokenization to the dataset.

In [51]:
tokenizer = tf_text.UnicodeScriptTokenizer()

In [52]:
def tokenize(text, unused_label):
    lower_case = tf_text.case_fold_utf8(text)
    
    return tokenizer.tokenize(lower_case)

In [53]:
tokenized_ds = all_labeled_data.map(tokenize)

<br>

You can iterate over the dataset and print out a few tokenized examples:

In [54]:
for text_batch in tokenized_ds.take(5):
    print("Tokens: ", text_batch.numpy())
    print()

Tokens:  [b'to' b'phthia' b',' b'her' b'whom' b'thou' b'shalt' b'most' b'approve'
 b'.']

Tokens:  [b'their' b'costly' b'raiment' b',' b'while' b'the' b'land' b'had' b'rest'
 b',']

Tokens:  [b'right' b'on' b',' b'but' b'smitten' b'by' b'some' b'dauntless' b'youth']

Tokens:  [b'thy' b'stronger' b'far' b',' b'and' b'dearer' b'to' b'the' b'gods' b'?']

Tokens:  [b'withheld' b'his' b'aid' b';' b'but' b'close' b'beside' b'her' b'foot']



<br>

Next, you will build a vocabulary by sorting tokens by frequency and keeping the top `VOCAB_SIZE` tokens:

In [55]:
tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda:0)

for toks in tokenized_ds.as_numpy_iterator():
    for tok in toks:
        vocab_dict[tok] += 1

vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]    # VOCAB_SIZE=10000
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries:", vocab[:5])

Vocab size:  10000
First five vocab entries: [b',', b'the', b'and', b"'", b'of']


<br>

<font size=3>To convert the tokens into integers, use the `vocab` set to create a `tf.lookup.StaticVocabularyTable`. You will map tokens to integers in the range [`2`, `vocab_size + 2`]. <font color=maroon>As with the `TextVectorization` layer, `0` is reserved to denote padding and `1` is reserved to denote an out-of-vocabulary (OOV) token.</font> </font>

In [56]:
keys = vocab
values = range(2, len(vocab) + 2)  # Reserve `0` for padding, `1` for OOV tokens.

init = tf.lookup.KeyValueTensorInitializer(keys,
                                           values,
                                           key_dtype=tf.string,
                                           value_dtype=tf.int64)

num_ovv_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_ovv_buckets)

<br>

Finally, define a function to standardize, tokenize and vectorize the dataset using the tokenizer and lookup table:

In [57]:
def preprocess_text(text, label):
    standardized = tf_text.case_fold_utf8(text)
    tokenized = tokenizer.tokenize(standardized)
    vectorized = vocab_table.lookup(tokenized)
    
    return vectorized, label

<br>

You can try this on a single example to print the output:

In [58]:
example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b'To Phthia, her whom thou shalt most approve.'
Vectorized sentence:  [   8 1033    2   50   65   47  469  260 3497    7]


<br>

Now run the preprocess function on the dataset using `Dataset.map`:

In [59]:
all_encoded_data = all_labeled_data.map(preprocess_text)

<br>
<br>

### Split the dataset into training and test sets

The Keras `TextVectorization` layer also batches and pads the vectorized data. Padding is required because the examples inside of a batch need to be the same size and shape, but the examples in these datasets are not all the same size—each line of text has a different number of words.

`tf.data.Dataset` supports splitting and padded-batching datasets:

In [60]:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

In [61]:
train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

<br>

Now, `validation_data` and `train_data` are not collections of (`example, label`) pairs, but collections of batches. Each batch is a pair of (*many examples*, *many labels*) represented as arrays.

To illustrate this:

In [62]:
sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 18)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[   8 1033    2   50   65   47  469  260 3497    7    0    0    0    0
    0    0    0    0], shape=(18,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)


<br>

Since you use `0` for padding and `1` for out-of-vocabulary (OOV) tokens, the vocabulary size has increased by two:

In [63]:
vocab_size += 2

<br>

Configure the datasets for better performance as before:

In [64]:
train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

<br>
<br>

### Train the model

You can train a model on this dataset as before:

In [65]:
model = create_model(vocab_size=vocab_size, num_labels=3)

model.compile(optimizer='adam',
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy']
             )

history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [66]:
loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.4299073815345764
Accuracy: 84.14%


<br>
<br>

### Export the model

To make the model capable of taking raw strings as input, you will create a Keras `TextVectorization` layer that performs the same steps as your custom preprocessing function. <font color=maroon size=3>Since you have already **trained a vocabulary**, you can use `TextVectorization.set_vocabulary` (instead of `TextVectorization.adapt`), which trains a new vocabulary.</font>

In [67]:
preprocess_layer = TextVectorization(max_tokens=vocab_size,
                                     standardize=tf_text.case_fold_utf8,
                                     split=tokenizer.tokenize,
                                     output_mode='int',
                                     output_sequence_length=MAX_SEQUENCE_LENGTH
                                    )

preprocess_layer.set_vocabulary(vocab)

In [68]:
export_model = tf.keras.Sequential([preprocess_layer, 
                                    model,
                                    layers.Activation('sigmoid')
                                   ])

export_model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                     optimizer='adam',
                     metrics=['accuracy']
                    )

In [69]:
# Create a test dataset of raw strings.
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)

loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.5588675737380981
Accuracy: 78.74%


<br>

The loss and accuracy for the model on encoded validation set and the exported model on the raw validation set are the same, as expected.

<br>
<br>

### Run inference on new data

In [70]:
inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]

predicted_scores = export_model.predict(inputs)
predicted_labels = tf.math.argmax(predicted_scores, axis=1)

for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label.numpy())
    print()

Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1

Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2

Question:  And with loud clangor of his arms he fell.
Predicted label:  0



<br>
<br>
<br>

## Download more datasets using TensorFlow Datasets (TFDS)

You can download many more datasets from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview).

In this example, you will use the [IMDB Large Movie Review dataset](https://www.tensorflow.org/datasets/catalog/imdb_reviews) to train a model for sentiment classification:

In [71]:
# help(tfds.load)

In [72]:
# Training set.
train_ds = tfds.load('imdb_reviews',
                     split='train[:80%]',
                     batch_size=BATCH_SIZE,
                     shuffle_files=True,
                     as_supervised=True,
                     data_dir="D:/KeepStudy/0_Coding/0_dataset/tensorflow_datasets/"
                    )

In [73]:
train_ds

<_OptionsDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

In [74]:
# Validation set.
val_ds = tfds.load('imdb_reviews',
                   split='train[80%:]',
                   batch_size=BATCH_SIZE,
                   shuffle_files=True,
                   as_supervised=True,
                   data_dir="D:/KeepStudy/0_Coding/0_dataset/tensorflow_datasets/"
                  )

<br>

Print a few examples:

In [75]:
for review_batch, label_batch in val_ds.take(1):
    for i in range(5):
        print("Review: ", review_batch[i].numpy())
        print("Label: ", label_batch[i].numpy())
        print()

Review:  b"Instead, go to the zoo, buy some peanuts and feed 'em to the monkeys. Monkeys are funny. People with amnesia who don't say much, just sit there with vacant eyes are not all that funny.<br /><br />Black comedy? There isn't a black person in it, and there isn't one funny thing in it either.<br /><br />Walmart buys these things up somehow and puts them on their dollar rack. It's labeled Unrated. I think they took out the topless scene. They may have taken out other stuff too, who knows? All we know is that whatever they took out, isn't there any more.<br /><br />The acting seemed OK to me. There's a lot of unfathomables tho. It's supposed to be a city? It's supposed to be a big lake? If it's so hot in the church people are fanning themselves, why are they all wearing coats?"
Label:  0

Review:  b'Well, was Morgan Freeman any more unusual as God than George Burns? This film sure was better than that bore, "Oh, God". I was totally engrossed and LMAO all the way through. Carrey wa

<br>

You can now preprocess the data and train a model as before.

Note: You will use `tf.keras.losses.BinaryCrossentropy` instead of `tf.keras.losses.SparseCategoricalCrossentropy` for your model, since this is a binary classification problem.

### Prepare the dataset for training

In [76]:
vectorize_layer = TextVectorization(max_tokens=VOCAB_SIZE,
                                    output_mode='int',
                                    output_sequence_length=MAX_SEQUENCE_LENGTH
                                   )

# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

In [77]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    
    return vectorize_layer(text), label

In [78]:
train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

In [79]:
# Configure datasets for performance as before.
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

<br>

### Create, configure and train the model

In [80]:
model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, None, 64)          640064    
                                                                 
 conv1d_2 (Conv1D)           (None, None, 64)          20544     
                                                                 
 global_max_pooling1d_2 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_3 (Dense)             (None, 1)                 65        
                                                                 
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________


<br>

In [81]:
model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=['accuracy']
              )

In [82]:
history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<br>

In [83]:
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.3255687355995178
Accuracy: 86.26%


<br>
<br>

### Export the model

In [84]:
export_model = tf.keras.Sequential([vectorize_layer, model,
                                    layers.Activation('sigmoid')
                                   ])

export_model.compile(loss=losses.SparseCategoricalCrossentropy(from_logits=False),
                     optimizer='adam',
                     metrics=['accuracy'])

In [85]:
# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]

predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]

for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label)
    print()

Question:  This is a fantastic movie.
Predicted label:  1

Question:  This is a bad movie.
Predicted label:  0

Question:  This movie was so bad that it was good.
Predicted label:  0

Question:  I will never say yes to watching this movie.
Predicted label:  0



<br>
<br>
<br>

## Conclusion

This tutorial demonstrated several ways to load and preprocess text. As a next step, you can explore additional text preprocessing [TensorFlow Text](https://www.tensorflow.org/text) tutorials, such as:

- [BERT Preprocessing with TF Text](https://www.tensorflow.org/text/guide/bert_preprocessing_guide)
- [Tokenizing with TF Text](https://www.tensorflow.org/text/guide/tokenizers)
- [Subword tokenizers](https://www.tensorflow.org/text/guide/subwords_tokenizer)

You can also find new datasets on [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). And, to learn more about `tf.data`, check out the guide on [building input pipelines](https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/guide/data.ipynb).

<br>
<br>
<br>

```python
# MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
```

<br>
<br>
<br>