<br>

<div align=center><font color=maroon size=6 style="line-height:40px;"><b>Visualizing Data using the Embedding Projector<br> in TensorBoard</b></font></div>

<br>

<font size=4><b>References:</b></font>
1. TensorFlow > <a href="https://www.tensorflow.org/resources" style="text-decoration:none;">Resources</a> 
    * `TensorFlow > Resources > TensorBoard > Guide > `<a href="https://www.tensorflow.org/tensorboard/tensorboard_projector_plugin" style="text-decoration:none;">Visualizing Data using the Embedding Projector in TensorBoard</a>
        * Run in <a href="https://colab.research.google.com/github/tensorflow/tensorboard/blob/master/docs/tensorboard_projector_plugin.ipynb" style="text-decoration:none;">Google Colab</a>

<br>
<br>
<br>

## Overview

Using the **TensorBoard Embedding Projector**, you can graphically represent high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers.

<!-- <img src="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding.jpg?raw=\" alt="Screenshot of the embedding projector" width="400"/> -->

In this tutorial, you will learn how visualize this type of trained layer.

<br>
<br>
<br>

## Setup

For this tutorial, we will be using TensorBoard to visualize an embedding layer generated for classifying movie review data.

In [1]:
try:
  # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

%load_ext tensorboard

In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorboard.plugins import projector

import os

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
print(tf.__version__)

2.5.0


<br>
<br>
<br>

## IMDB Data 

We will be using a dataset of 25,000 IMDB movie reviews, each of which has a sentiment label (positive/negative). Each review is preprocessed and encoded as a sequence of word indices (integers). <font style="color:maroon;font-size:120%">**For simplicity, words are indexed by overall frequency in the dataset, for instance the integer "3" encodes the 3rd most frequent word appearing in all reviews**. This allows for quick filtering operations such as:</font> "only consider the top 10,000 most common words, but eliminate the top 20 most common words".

<font style="color:maroon;font-size:110%">As a convention, **"0"** does not stand for any specific word, but instead is used to encode any unknown word</font>. Later in the tutorial, we will remove the row for "0" in the visualization.


In [4]:
(train_data, test_data), info = tfds.load("imdb_reviews/subwords8k",
                                          split=(tfds.Split.TRAIN, tfds.Split.TEST),
                                          with_info=True,
                                          as_supervised=True,
                                          data_dir="D:/KeepStudy/0_Coding/0_dataset/tensorflow_datasets/")



In [5]:
encoder = info.features["text"].encoder
encoder

<SubwordTextEncoder vocab_size=8185>

In [6]:
# Shuffle and pad the data
train_batches = train_data.shuffle(1000).padded_batch(10, 
                                                      padded_shapes=((None, ), ())  # () 对应 labels
                                                     )

test_batches = test_data.shuffle(1000).padded_batch(10,
                                                    padded_shapes=((None, ), ())
                                                   )

In [7]:
train_batch, train_labels = next(iter(train_batches))

<br>
<br>
<br>

## Keras Embedding Layer

<font style="color:maroon;font-size:110%">A [Keras Embedding Layer](https://keras.io/layers/embeddings/) can be used to train an embedding for each word in your vocabulary. Each word (or sub-word in this case) will be associated with a 16-dimensional vector (or embedding) that will be trained by the model.

See [this tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings?hl=en) to learn more about word embeddings.</font>

In [8]:
# Create an embedding layer.
embedding_dim = 16
embedding = tf.keras.layers.Embedding(encoder.vocab_size, embedding_dim)


# Configure the embedding layer as part of a keras model.
model = tf.keras.Sequential([embedding,  # The embedding layer should be the first layer in a model.
                             tf.keras.layers.GlobalAveragePooling1D(),
                             tf.keras.layers.Dense(16, activation='relu'),
                             tf.keras.layers.Dense(1)])


# Compile model.
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],)

In [9]:
# Train model for one epoch.
history = model.fit(train_batches, 
                    epochs=1, 
                    validation_data=test_batches, 
                    validation_steps=20)



<br>
<br>
<br>

## Saving Data for TensorBoard

<font style="color:maroon;font-size:110%">TensorBoard reads tensors and metadata from the logs of your tensorflow projects.</font> The path to the log directory is specified with `log_dir` below. For this tutorial, we will be using `/logs/imdb-example/`.

<font style="color:maroon;font-size:110%">In order to load the data into Tensorboard, we need to save a training checkpoint to that directory, along with metadata that allows for visualization of a specific layer of interest in the model. </font>

In [10]:
# pip install chardet

import chardet

try:
    for subwords in encoder.subwords[10]:
        print(chardet.detect(subwords))
        # 报错：
        # TypeError: Expected object of type bytes or bytearray, got: <class 'str'>

except Exception as e:
    print(e)

Expected object of type bytes or bytearray, got: <class 'str'>


<br>

In [11]:
# Set up a logs directory, so Tensorboard knows where to look for files.
log_dir = './logs/imdb-example/'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

In [12]:
# help(encoder)
# help(open)

In [13]:
# Save Labels separately on a line-by-line manner.
with open(os.path.join(log_dir, 'metadata.tsv'), 'w', encoding='utf-8') as f:
    for subwords in encoder.subwords:
        f.write("{}\n".format(subwords))
        
    # Fill in the rest of the labels with 'unknown'.
    for unknown in range(1, encoder.vocab_size - len(encoder.subwords)):
        f.write("unknown #{}\n".format(unknown))


# 需要在 open() 里加上参数 encoding='utf-8'
# 否则会报错：
# UnicodeEncodeError: 'gbk' codec can't encode character '\x96' in position 1: illegal multibyte sequence
# 
# 参考链接：https://www.cnblogs.com/schut/p/10579955.html

In [14]:
# Save the weights we want to analyze as a variable.
# Note that the first value represents any unknown word, which is not in the metadata,
# here we will remove this value.
weights = tf.Variable(model.layers[0].get_weights()[0][1:])

# Create a checkpoint from embedding, the filename and key are the name of the tensor.
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

'./logs/imdb-example/embedding.ckpt-1'

In [15]:
# Set up config.
config = projector.ProjectorConfig()
embedding = config.embeddings.add()

# The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE`
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = "metadata.tsv"
projector.visualize_embeddings(log_dir, config)

In [16]:
# !ls -al ./logs/imdb-example/

In [19]:
# Now run tensorboard against on log data we just saved.

%tensorboard --logdir log_dir

# %tensorboard --logdir logs/imdb-example/
# 也可在浏览器中输入 http://localhost:6006 查看

# %tensorboard --logdir=logs/imdb-example/
# 也可在浏览器中输入 http://localhost:6006 查看

# 或者下面这句：
# %tensorboard --logdir=logs/imdb-example/ --port=8008
# 也可在浏览器中输入 http://localhost:8008 查看

Reusing TensorBoard on port 6006 (pid 5820), started 0:27:41 ago. (Use '!kill 5820' to kill it.)

<br>
<br>
<br>

## Analysis
The TensorBoard Projector is a great tool for interpreting and visualzing embedding. The dashboard allows users to search for specific terms, and highlights words that are adjacent to each other in the embedding (low-dimensional) space. From this example we can see that Wes **Anderson** and Alfred **Hitchcock** are both rather neutral terms, but that they are referenced in different contexts.

<!-- <img class="tfo-display-only-on-site" src="images/embedding_projector_hitchcock.png?raw=1"/> -->

In this space, Hitchcock is closer to words like `nightmare`, which is likely due to the fact that he is known as the "Master of Suspense", whereas Anderson is closer to the word `heart`, which is consistent with his relentlessly detailed and heartwarming style.

<!-- <img class="tfo-display-only-on-site" src="images/embedding_projector_anderson.png?raw=1"/> -->

<br>
<br>
<br>

<font style="color:maroon;font-size:110%">

<br>

**Copyright 2019 The TensorFlow Authors.**

```python
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
```

<br>
<br>
<br>