## Text Processing and Word Embedding

对 IMDB 电影评论的数据进行分析。

In [96]:
import io
import re
import string
import pathlib
import tensorflow as tf

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.layers import TextVectorization

In [97]:
imdb_path = pathlib.Path('./../../../dataset/imdb/aclImdb')
train_path = imdb_path / 'train'
test_path = imdb_path / 'test'

In [98]:
list(train_path.iterdir())

[PosixPath('../../../dataset/imdb/aclImdb/train/.DS_Store'),
 PosixPath('../../../dataset/imdb/aclImdb/train/neg'),
 PosixPath('../../../dataset/imdb/aclImdb/train/urls_pos.txt'),
 PosixPath('../../../dataset/imdb/aclImdb/train/urls_neg.txt'),
 PosixPath('../../../dataset/imdb/aclImdb/train/pos')]

In [99]:
batch_size = 512
seed = 33
train_ds, val_ds = tf.keras.utils.text_dataset_from_directory(train_path,
                                          shuffle=True,
                                          seed=33,
                                          validation_split=0.2,
                                          subset='both')

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Using 5000 files for validation.


In [100]:
for text_batch, label_batch in train_ds.take(1):
    for i in range(3):
        print(label_batch[i].numpy(), text_batch.numpy()[i])

1 b'This is one military drama I like a lot! Tom Berenger playing military assassin Thomas Beckett. This Marine is no-nonsense, in your face, and no questions asked kind of person who gets the job done. There you have Billy Zane("The Phantom" and others) who plays Richard Miller, a former SWAT form D.C., works for the government and takes orders only from them. Who needs a bureaucrat? I don\'t! When these two are paired, sparks should be flying. And how. However, Beckett teaches the young bureaucrat on how it works. When the other sniper hits, it\'s wits vs. wits, cat vs. mouse, gunman vs. gunman. And when the seasoned sniper is caught, it\'s up to Miller to put politics aside and save him. Who needs politics when you a pro like Beckett, he took orders from no one but himself, plays by the rules and not the book, and mutual respect is brought out despite the politics. The movie was a direct hit. Watch it. Rating 4 out of 5 stars.'
0 b"A truly frightening film. Feels as if it were made 

#### [Text Preprocessing](https://www.tensorflow.org/text/guide/word_embeddings#text_preprocessing)

我们需要对文本数据进行一下预处理，然后将文本进行向量化。


In [101]:
# Create a custom standardization function to strip HTML '<br/>'.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped,
                                    '[%s]' % re.escape(string.punctuation), '')


In [102]:
# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings
# to integer. Set maximum_sequence length as all samples are not of the
# same length.
vectorization_layer = tf.keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length
)

In [103]:
# Make a text-only dataset (no labels) and call adapt to build the vocab
text_ds = train_ds.map(lambda x, y: x)
vectorization_layer.adapt(text_ds)

### Create A Classification Model


In [104]:
embedding_dim = 16

model = Sequential()

model.add(vectorization_layer)
model.add(Embedding(vocab_size, embedding_dim, name="embedding"))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(1))

In [105]:
tb_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
model.compile(optimizer="RMSprop",
             loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
             metrics=['accuracy'])

In [106]:
model.fit(train_ds,
         validation_data=test_ds,
         epochs=10,
         callbacks=[tb_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x161736f70>

In [87]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 text_vectorization_2 (TextV  (None, 100)              0         
 ectorization)                                                   
                                                                 
 embedding (Embedding)       (None, 100, 16)           160000    
                                                                 
 flatten_2 (Flatten)         (None, 1600)              0         
                                                                 
 dense_4 (Dense)             (None, 16)                25616     
                                                                 
 dense_5 (Dense)             (None, 1)                 17        
                                                                 
Total params: 185,633
Trainable params: 185,633
Non-trainable params: 0
________________________________________________

In [88]:
#docs_infra: no_execute
%load_ext tensorboard
%tensorboard --logdir logs

### Retrieve the trained word embedding and save them to disk

我们可以把词典对就的 Embedding 下载保存的本地。保存后还可以上传到 Embedding Projector 上进行观察。

In [93]:
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorization_layer.get_vocabulary()

In [107]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue # skip 0, it's padding
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

In [109]:
test_ds = tf.keras.utils.text_dataset_from_directory(test_path)

Found 25000 files belonging to 2 classes.


In [110]:
model.evaluate(test_ds)



[1.2282118797302246, 0.7752000093460083]