# [テキストの読み込み](https://www.tensorflow.org/tutorials/load_data/text)

テキストデータを読み込み事前処理を行う方法：

1. Kerasの機能とそれが提供している層を利用する
2. `tf.data.TextLineDataset`で読み込み、`tf.text`で事前処理するなどの低レベルな機能を使う

In [1]:
import collections
import pathlib
import re
import string

import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import preprocessing
from tensorflow.keras import utils
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

import tensorflow_datasets as tfds
import tensorflow_text as tf_text

## 例1

ここでは例として、Stack Overflowから得られたプログラミングに関する質問のデータセットを使う。それぞれの質問は一つのタグ（`Python`, `CSharp`, `JavaScript`, `Java`）でラベルづけされている。このタグを予想するモデルを構築する（複数クラスの分類問題）。

In [2]:
data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz'
dataset = utils.get_file(
    'stack_overflow_16k.tar.gz',
    data_url,
    untar=True,
    cache_dir='stack_overflow',
    cache_subdir=''
)
dataset_dir = pathlib.Path(dataset).parent

In [3]:
list(dataset_dir.iterdir())

[PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/README.md'),
 PosixPath('/tmp/.keras/train'),
 PosixPath('/tmp/.keras/stack_overflow_16k.tar.gz.tar.gz')]

In [4]:
train_dir = dataset_dir/'train'
list(train_dir.iterdir())

[PosixPath('/tmp/.keras/train/python'),
 PosixPath('/tmp/.keras/train/java'),
 PosixPath('/tmp/.keras/train/csharp'),
 PosixPath('/tmp/.keras/train/javascript')]

`train`ディレクトリ下にあるこれらのディレクトリは多くのテキストファイルを持ち、それぞれがStack Overflowの質問文である。

In [5]:
sample_file = train_dir/'python/1755.txt'
with open(sample_file) as f:
    print(f.read())

why does this blank program print true x=true.def stupid():.    x=false.stupid().print x



データセットを読み込み事前処理を行う。`text_dataset_from_directory`を利用して`tf.data.Dataset`を作成する。`preprocessing.text_dataset_from_directory`はディレクトリ構造が以下のようであることが前提である：
```
train/
...csharp/
......1.txt
......2.txt
...java/
......1.txt
......2.txt
...javascript/
......1.txt
......2.txt
...python/
......1.txt
......2.txt
```

トレーニング用、テスト用、検証用のデータセットに分けて作成する。

In [6]:
batch_size = 32
seed = 42

# Training dataset
raw_train_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [7]:
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(10):
        print("Question: ", text_batch.numpy()[i])
        print("Label: ", label_batch.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default con

In [8]:
# See which integer label correspond to which string label
for i, label in enumerate(raw_train_ds.class_names):
    print("Label", i, 'corresponds to', label)

Label 0 corresponds to csharp
Label 1 corresponds to java
Label 2 corresponds to javascript
Label 3 corresponds to python


In [9]:
# Validation dataset
raw_val_ds = preprocessing.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [10]:
# Test dataset
test_dir = dataset_dir/'test'
raw_test_ds = preprocessing.text_dataset_from_directory(
    test_dir, batch_size=batch_size)

Found 8000 files belonging to 4 classes.


以下の操作をして学習ようにデータセットを準備する。

- 標準化：括弧やHTMLタグを取り除いたりしてテキストを単純にする
- トークン化：文章を単語などの小さいトークンに分割する
- ベクトル化：トークンをニューラルネットに渡せるように数値化する

これらの操作は`TextVectorization`層を使って実現できる。この層のデフォルト設定は：

- 標準化：小文字に変換して括弧を削除する
- トークン化：空白`' '`で分割する
- ベクトル化：インデックスを示す整数に変換する（単語の順番を考慮したいときに便利）

ここでは二つのモデルを作成する。

1. ベクトル化を`binary`モードにして "bag-of-words" モデルを構築する
2. ベクトル化を`int`モードにして一次元のCNNを構築する

In [11]:
VOCAB_SIZE = 10000

binary_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary'
)

In [12]:
MAX_SEQUENCE_LENGTH = 250

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    # Explicit maximum sequence length will cause the layer
    # to pad or truncate sequences to exactly `sequence_length` values.
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

`adapt`メソッドを使って、作成した層をデータセットに適応させる。

In [13]:
# Make a test-only dataset (without labels), then call adapt
train_text = raw_train_ds.map(lambda text, labels: text)
binary_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)

この層による出力結果を見てみる

In [14]:
def binary_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return binary_vectorize_layer(text), label

def int_vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return int_vectorize_layer(text), label

# Retrieve a batch (of 32 questions and labels) from the dataset
text_batch, label_batch = next(iter(raw_train_ds))
first_question, first_label = text_batch[0], label_batch[0]
print('Question:', first_question)
print('Label:', first_label)

Question: tf.Tensor(b'"function expected error in blank for dynamically created check box when it is clicked i want to grab the attribute value.it is working in ie 8,9,10 but not working in ie 11,chrome shows function expected error..&lt;input type=checkbox checked=\'checked\' id=\'symptomfailurecodeid\' tabindex=\'54\' style=\'cursor:pointer;\' onclick=chkclickevt(this);  failurecodeid=""1"" &gt;...function chkclickevt(obj) { .    alert(obj.attributes(""failurecodeid""));.}"\n', shape=(), dtype=string)
Label: tf.Tensor(2, shape=(), dtype=int32)


In [15]:
print("'binary' vectorized question:",
      binary_vectorize_text(first_question, first_label)[0])

'binary' vectorized question: tf.Tensor([[1. 1. 1. ... 0. 0. 0.]], shape=(1, 10000), dtype=float32)


In [16]:
print("'int' vectorized question:",
      int_vectorize_text(first_question, first_label)[0])

'int' vectorized question: tf.Tensor(
[[  38  450   65    7   16   12  892  265  186  451   44   11    6  685
     3   46    4 2062    2  485    1    6  158    7  479    1   26   20
   158    7  479    1  502   38  450    1 1767 1763    1    1    1    1
     1    1    1    1    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0

`binary`モードでは、それぞれのトークンが出現したかしなかったかを示す0または1の配列を返す。`int`モードでは、それぞれのトークンの添字を示した整数の配列を返す。`.get_vocabulary()`メソッドで対応するトークンを確認できる。

In [17]:
print('1289 ---> ', int_vectorize_layer.get_vocabulary()[1289])
print('313  ---> ', int_vectorize_layer.get_vocabulary()[313])
print('Vocabulary size:', len(int_vectorize_layer.get_vocabulary()))

1289 --->  roman
313  --->  source
Vocabulary size: 10000


学習のために各データセットを変換しておく

In [18]:
binary_train_ds = raw_train_ds.map(binary_vectorize_text)
binary_val_ds = raw_val_ds.map(binary_vectorize_text)
binary_test_ds = raw_test_ds.map(binary_vectorize_text)

int_train_ds = raw_train_ds.map(int_vectorize_text)
int_val_ds = raw_val_ds.map(int_vectorize_text)
int_test_ds = raw_test_ds.map(int_vectorize_text)

次に、パフォーマンス向上のためにデータセットに対して以下の設定を行う。

- `.cache()`：データをメモリに保存する。これを使えばメモリに対してはるかに大きいデータセットを使った時でも効率的にデータの読み込みができる。
- `.prefetch()`：データの事前処理とモデルの実行を重ね合わせる。

詳細は [data performance guide](https://www.tensorflow.org/guide/data_performance)

In [19]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(dataset):
    return dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [20]:
binary_train_ds = configure_dataset(binary_train_ds)
binary_val_ds = configure_dataset(binary_val_ds)
binary_test_ds = configure_dataset(binary_test_ds)

int_train_ds = configure_dataset(int_train_ds)
int_val_ds = configure_dataset(int_val_ds)
int_test_ds = configure_dataset(int_test_ds)

モデルの学習

In [21]:
# For the `binary` vectorized data, train a simple bag-of-words linear model
binary_model = tf.keras.Sequential([layers.Dense(4)])
binary_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)
history = binary_model.fit(
    binary_train_ds, validation_data=binary_val_ds, epochs=10
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [22]:
# For the `int` vectorized data, train a 1D ConvNet
def create_model(vocab_size, num_labels):
    model = tf.keras.Sequential([
        layers.Embedding(vocab_size, 64, mask_zero=True),
        layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
        layers.GlobalMaxPooling1D(),
        layers.Dense(num_labels)
    ])
    return model

# vocab_size is VOCAB_SIZE + 1 since 0 is used additionally for padding.
int_model = create_model(vocab_size=VOCAB_SIZE+1, num_labels=4)
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)
history = int_model.fit(
    int_train_ds, validation_data=int_val_ds, epochs=5
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


二つのモデルを比較する

In [23]:
print("Linear model on binary vectorized data:")
print(binary_model.summary())

Linear model on binary vectorized data:
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 4)                 40004     
Total params: 40,004
Trainable params: 40,004
Non-trainable params: 0
_________________________________________________________________
None


In [24]:
print("ConvNet model on int vectorized data:")
print(int_model.summary())

ConvNet model on int vectorized data:
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          640064    
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d (Global (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 260       
Total params: 660,868
Trainable params: 660,868
Non-trainable params: 0
_________________________________________________________________
None


In [25]:
binary_loss, binary_accuracy = binary_model.evaluate(binary_test_ds)
int_loss, int_accuracy = int_model.evaluate(int_test_ds)

print("Binary model accuracy: {:2.2%}".format(binary_accuracy))
print("Int model accuracy:    {:2.2%}".format(int_accuracy))

Binary model accuracy: 81.61%
Int model accuracy:    81.11%


`TextVectorization`層をモデルに組み込んで保存と読み込みが他のデバイスでも簡単にできるようにする。学習済みの重みを使って新しいモデルを作成することで簡単に実現できる。

In [26]:
export_model = tf.keras.Sequential([
    binary_vectorize_layer,
    binary_model,
    layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy']
)

# Test it with `raw_test_ds`, which yields raw strings
loss, accuracy = export_model.evaluate(raw_test_ds)
print("Accuracy: {:2.2%}".format(accuracy))

Accuracy: 81.61%


これで`export_model`は生の文字列データをインプットとして受け取ることができ、`model.predict`メソッドを使ってそれぞれのラベルに対して確率を計算することができる。確率が最も高いラベルを出力する関数を定義すると：

In [27]:
def get_string_labels(predicted_scores_batch):
    predicted_int_labels = tf.argmax(predicted_scores_batch, axis=1)
    predicted_labels = tf.gather(raw_train_ds.class_names, predicted_int_labels)
    return predicted_labels

In [28]:
inputs = [
    "how do I extract keys from a dict into a list?",  # python
    "debug public static void main(string[] args) {...}",  # java
]
predicted_scores = export_model.predict(inputs)
predicted_labels = get_string_labels(predicted_scores)
for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label.numpy())

Question:  how do I extract keys from a dict into a list?
Predicted label:  b'python'
Question:  debug public static void main(string[] args) {...}
Predicted label:  b'java'


上位のネットワークを学習してから`TextVectorization`層を挿入することで、非同期のCPUやGPUを使って学習を行うことができる。詳細は [tutorial about saving models](https://www.tensorflow.org/tutorials/keras/save_and_load)

## 例2

`tf.data.TextLineDataset`を使ってテキストファイルから読み込み、`tf.text`を使ってデータの事前処理を行う。ここでは同じ文章の三つの異なる英語訳を使って、翻訳者が誰なのかを推測するモデルを構築する。翻訳者には以下の3人がいる：

- William Cowper
- Edward, Earl of Derby
- Samuel Butler

ここで使われるテキストファイルは基本的な事前処理（ヘッダーとフッター、行番号、章題の削除）がすんでいる。ダウンロードしてみる。

In [29]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
    text_dir = utils.get_file(name, origin=DIRECTORY_URL + name)

parent_dir = pathlib.Path(text_dir).parent
list(parent_dir.iterdir())

[PosixPath('/Users/nakayamayasuaki/.keras/datasets/cowper.txt'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/imdb_word_index.json'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/flower_photos.tar.gz'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/mnist.npz'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/heart.csv'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/fashion-mnist'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/butler.txt'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/HIGGS.csv.gz'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/imdb.npz'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/derby.txt'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/train.csv'),
 PosixPath('/Users/nakayamayasuaki/.keras/datasets/flower_photos')]

`TextLineDataset`を使ってデータセットを読み込む。これは`tf.data.Dataset`をそれぞれの例文が一行で表されているテキストファイルから作成する（`text_dataset_from_directory`を使うときはファイル全体が一つのサンプルとして扱われる）。

ダウンロードしたファイルを読み込んでそれぞれデータセットを作成する。各行はそれぞれ正解ラベルをつける必要があるので、`tf.data.Dataset.map`を使ってそれぞれにラベルづけ関数を適用する。

In [30]:
def labeler(example, index):
    return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
    lines_dataset = tf.data.TextLineDataset(str(parent_dir/file_name))
    labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
    labeled_data_sets.append(labeled_dataset)

In [31]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
VALIDATION_SIZE = 5000

In [32]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
    all_labeled_data = all_labeled_data.concatenate(labeled_dataset)
    
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False
)

いくつかの例文を出力してみる。まだバッチ処理をしていないので、一回取り出したデータは一つのデータ点に対応する。

In [33]:
for text, label in all_labeled_data.take(10):
    print("Sentence: ", text.numpy())
    print("Label:    ", label.numpy())

Sentence:  b'Whom I may slaughter; and no want of Greeks'
Label:     0
Sentence:  b'Nor ample shields they bore, nor ashen spear;'
Label:     1
Sentence:  b"The dark-hair'd monarch spoke; and led the way"
Label:     1
Sentence:  b"His peril imminent, snapp'd short the brace"
Label:     0
Sentence:  b"Were giv'n by Agamemnon, King of men,"
Label:     1
Sentence:  b"They waited not, but, fir'd with equal rage,"
Label:     1
Sentence:  b'Into the stream, and, as he floated down,'
Label:     0
Sentence:  b"He rais'd, and thus with gentle words address'd:"
Label:     1
Sentence:  b'Of Jove advanced to honor and renown!'
Label:     0
Sentence:  b"To pay their fun'ral rites; for Saturn's son"
Label:     1


データセットを学習に適した形にする。ここでは`tf.text` APIを使って標準化とトークン化を行い、`StaticVocabularyTable`を使ってトークンを整数値にマップする。トークン化では`UnicodeScriptTokenizer`を使う。

まず始めに、テキストを小文字にしてトークン化する関数を定義する。作成した関数を`tf.data.Dataset.map`を使ってデータセットに適用する。

In [34]:
tokenizer = tf_text.UnicodeScriptTokenizer()

def tokenize(text, unused_label):
    lower_case = tf_text.case_fold_utf8(text)
    return tokenizer.tokenize(lower_case)

tokenized_ds = all_labeled_data.map(tokenize)

Instructions for updating:
`tf.batch_gather` is deprecated, please use `tf.gather` with `batch_dims=-1` instead.


In [35]:
# print out a few tokenized examples
for text_batch in tokenized_ds.take(5):
    print("Tokens: ", text_batch.numpy())

Tokens:  [b'whom' b'i' b'may' b'slaughter' b';' b'and' b'no' b'want' b'of'
 b'greeks']
Tokens:  [b'nor' b'ample' b'shields' b'they' b'bore' b',' b'nor' b'ashen' b'spear'
 b';']
Tokens:  [b'the' b'dark' b'-' b'hair' b"'" b'd' b'monarch' b'spoke' b';' b'and'
 b'led' b'the' b'way']
Tokens:  [b'his' b'peril' b'imminent' b',' b'snapp' b"'" b'd' b'short' b'the'
 b'brace']
Tokens:  [b'were' b'giv' b"'" b'n' b'by' b'agamemnon' b',' b'king' b'of' b'men'
 b',']


単語の出現頻度でソートして、単語に関する辞書を作成する。

In [36]:
tokenized_ds = configure_dataset(tokenized_ds)

vocab_dict = collections.defaultdict(lambda: 0)
for toks in tokenized_ds.as_numpy_iterator():
    for tok in toks:
        vocab_dict[tok] += 1
        
vocab = sorted(vocab_dict.items(), key=lambda x: x[1], reverse=True)
vocab = [token for token, count in vocab]
vocab = vocab[:VOCAB_SIZE]
vocab_size = len(vocab)
print("Vocab size: ", vocab_size)
print("First five vocab entries: ", vocab[:5])

Vocab size:  10000
First five vocab entries:  [b',', b'the', b'and', b"'", b'of']


トークンを整数値に変換するためには、`vocab`を使って`StaticVocabularyTable`を作成する。`[2, vocab_size + 2]`の範囲でトークンを整数値に変換する。`TextVectorization`層と同様に、`0`はパディング、`1`はそれ以外(OOV)に対応する。

In [37]:
keys = vocab
values = range(2, len(vocab) + 2) # reserve 0 for padding, 1 for OOV

init = tf.lookup.KeyValueTensorInitializer(
    keys, values, key_dtype=tf.string, value_dtype=tf.int64
)

num_oov_buckets = 1
vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov_buckets)

In [38]:
def preprocess_text(text, label):
    standardized = tf_text.case_fold_utf8(text)
    tokenized = tokenizer.tokenize(standardized)
    vectorized = vocab_table.lookup(tokenized)
    return vectorized, label

In [39]:
example_text, example_label = next(iter(all_labeled_data))
print("Sentence: ", example_text.numpy())
vectorized_text, example_label = preprocess_text(example_text, example_label)
print("Vectorized sentence: ", vectorized_text.numpy())

Sentence:  b'Whom I may slaughter; and no want of Greeks'
Vectorized sentence:  [  65   21   78  585   10    4   76 1120    6   89]


事前処理の定義ができたので、データセットに適用する。

In [40]:
all_encoded_data = all_labeled_data.map(preprocess_text)

データを学習用とテスト用に分ける

In [41]:
train_data = all_encoded_data.skip(VALIDATION_SIZE).shuffle(BUFFER_SIZE)
validation_data = all_encoded_data.take(VALIDATION_SIZE)

train_data = train_data.padded_batch(BATCH_SIZE)
validation_data = validation_data.padded_batch(BATCH_SIZE)

sample_text, sample_labels = next(iter(validation_data))
print("Text batch shape: ", sample_text.shape)
print("Label batch shape: ", sample_labels.shape)
print("First text example: ", sample_text[0])
print("First label example: ", sample_labels[0])

Text batch shape:  (64, 17)
Label batch shape:  (64,)
First text example:  tf.Tensor(
[  65   21   78  585   10    4   76 1120    6   89    0    0    0    0
    0    0    0], shape=(17,), dtype=int64)
First label example:  tf.Tensor(0, shape=(), dtype=int64)


In [42]:
# Since we use 0 for padding and 1 for out-of-vocabulary (OOV) tokens,
# the vocabulary size has increased by two.
vocab_size += 2

# Configure the datasets for better performance as before.
train_data = configure_dataset(train_data)
validation_data = configure_dataset(validation_data)

モデルの構築と学習

In [43]:
model = create_model(vocab_size=vocab_size, num_labels=3)
model.compile(
    optimizer='adam',
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)
history = model.fit(train_data, validation_data=validation_data, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [44]:
loss, accuracy = model.evaluate(validation_data)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.3881083130836487
Accuracy: 84.92%


最後に、生のデータを処理できるようにモデルを変える。

In [46]:
preprocess_layer = TextVectorization(
    max_tokens=vocab_size,
    standardize=tf_text.case_fold_utf8,
    split=tokenizer.tokenize,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)
preprocess_layer.set_vocabulary(vocab)

In [47]:
export_model = tf.keras.Sequential([
    preprocess_layer,
    model,
    layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy']
)

In [48]:
# Create a test dataset of raw strings
test_ds = all_labeled_data.take(VALIDATION_SIZE).batch(BATCH_SIZE)
test_ds = configure_dataset(test_ds)
loss, accuracy = export_model.evaluate(test_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.568994402885437
Accuracy: 78.58%


In [49]:
inputs = [
    "Join'd to th' Ionians with their flowing robes,",  # Label: 1
    "the allies, and his armour flashed about him so that he seemed to all",  # Label: 2
    "And with loud clangor of his arms he fell.",  # Label: 0
]
predicted_scores = export_model.predict(inputs)
predicted_labels = tf.argmax(predicted_scores, axis=1)
for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label.numpy())

Question:  Join'd to th' Ionians with their flowing robes,
Predicted label:  1
Question:  the allies, and his armour flashed about him so that he seemed to all
Predicted label:  2
Question:  And with loud clangor of his arms he fell.
Predicted label:  0


## TensorFlow Datasets (TFDS) にあるデータセット

In [51]:
# IMDB Large Movie Review dataset
train_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True
)

val_ds = tfds.load(
    'imdb_reviews',
    split='train',
    batch_size=BATCH_SIZE,
    shuffle_files=True,
    as_supervised=True
)

for review_batch, label_batch in val_ds.take(1):
    for i in range(5):
        print("Review: ", review_batch[i].numpy())
        print("Label:  ", label_batch[i].numpy())

Review:  b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
Label:   0
Review:  b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubb

データの事前処理

In [52]:
vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH
)

# Make a text-only dataset (without labels), then call adapt
train_text = train_ds.map(lambda text, labels: text)
vectorize_layer.adapt(train_text)

def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)

# Configure datasets for performance as before
train_ds = configure_dataset(train_ds)
val_ds = configure_dataset(val_ds)

モデルの学習

In [53]:
model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=1)
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 64)          640064    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, None, 64)          20544     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 65        
Total params: 660,673
Trainable params: 660,673
Non-trainable params: 0
_________________________________________________________________


In [54]:
model.compile(
    loss=losses.BinaryCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy']
)

history = model.fit(train_ds, validation_data=val_ds, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


In [55]:
loss, accuracy = model.evaluate(val_ds)

print("Loss: ", loss)
print("Accuracy: {:2.2%}".format(accuracy))

Loss:  0.09198682010173798
Accuracy: 97.90%


モデルのエクスポート

In [56]:
export_model = tf.keras.Sequential([
    vectorize_layer,
    model,
    layers.Activation('sigmoid')
])

export_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=False),
    optimizer='adam',
    metrics=['accuracy'])

# 0 --> negative review
# 1 --> positive review
inputs = [
    "This is a fantastic movie.",
    "This is a bad movie.",
    "This movie was so bad that it was good.",
    "I will never say yes to watching this movie.",
]
predicted_scores = export_model.predict(inputs)
predicted_labels = [int(round(x[0])) for x in predicted_scores]
for input, label in zip(inputs, predicted_labels):
    print("Question: ", input)
    print("Predicted label: ", label)

Question:  This is a fantastic movie.
Predicted label:  1
Question:  This is a bad movie.
Predicted label:  0
Question:  This movie was so bad that it was good.
Predicted label:  0
Question:  I will never say yes to watching this movie.
Predicted label:  0
