# Visualizing Data using the Embedding Projector in TensorBoard

## Overview

使用TensorBoard Embedding Projector，您可以以图形方式表示高维嵌入。这有助于可视化、检查和理解您的嵌入层。

## Setup

在本教程中，我们将使用TensorBoard可视化为对电影评论数据进行分类而生成的嵌入层。

In [1]:
try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

%load_ext tensorboard

In [2]:
!pip install tensorflow_datasets



In [3]:
import os
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorboard.plugins import projector

  from .autonotebook import tqdm as notebook_tqdm


## IMDB Data

我们将使用25,000条IMDB电影评论的数据集，每个评论都有一个情绪标签（正面/负面）。每个评论都经过预处理，并编码为一系列单词索引（整数）。为了简单起见，单词按数据集中的总体频率进行索引，例如整数“3”编码所有评论中出现的第3个最常见的单词。这允许快速过滤操作，例如：“只考虑前10,000个最常见的单词，但消除前20个最常见的单词”。

按照惯例，“0”不代表任何特定单词，而是用于对任何未知单词进行编码。在教程的后面，我们将删除可视化中“0”的行。

In [4]:
(train_data, test_data), info = tfds.load(
    "imdb_reviews/subwords8k",
    split=(tfds.Split.TRAIN, tfds.Split.TEST),
    with_info=True,
    as_supervised=True,
)
encoder = info.features["text"].encoder

train_batches = train_data.shuffle(1000).padded_batch(
    10, padded_shapes=((None,), ())
)
test_batches = test_data.shuffle(1000).padded_batch(
    10, padded_shapes=((None,), ())
)
train_batch, train_labels = next(iter(train_batches))



Metal device set to: Apple M1


2022-05-02 11:37:18.165350: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-05-02 11:37:18.165454: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-05-02 11:37:18.206515: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-05-02 11:37:18.206670: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
2022-05-02 11:37:18.232184: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the

## Keras Embedding Layer

Keras embedding 层可用于训练词汇表中每个单词的嵌入。每个单词（在这种情况下是子单词）将与模型训练的16维向量（或嵌入）相关联。

In [5]:
embedding_dim = 16
embedding = tf.keras.layers.Embedding(encoder.vocab_size, embedding_dim)
model = tf.keras.Sequential(
    [
        embedding,
        tf.keras.layers.GlobalAveragePooling1D(),
        tf.keras.layers.Dense(16, activation="relu"),
        tf.keras.layers.Dense(1),
    ]
)

model.compile(
    optimizer="adam",
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

history = model.fit(
    train_batches, epochs=1, validation_data=test_batches, validation_steps=20
)

   1/2500 [..............................] - ETA: 11:57 - loss: 0.6927 - accuracy: 0.3000

2022-05-02 11:37:18.562654: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.




2022-05-02 11:39:02.155129: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.




2022-05-02 11:39:02.646479: W tensorflow/core/kernels/data/cache_dataset_ops.cc:768] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


## Saving Data for TensorBoard

TensorBoard 从 tensorboard 项目的日志中读取张量和元数据。日志目录的路径在下面用 `log_dir` 指定。在本教程中，我们将使用 `/logs/imdb-example/`。

为了将数据加载到 Tensorboard 中，我们需要将训练检查点保存到该目录中，以及允许可视化模型中特定兴趣层的元数据。

In [7]:
log_dir = './logs/imdb-example/'
if not os.path.exists(log_dir):
    os.makedirs(log_dir)
    
with open(os.path.join(log_dir, 'metadata.tsv'), "w") as f:
    for subwords in encoder.subwords:
        f.write("{}\n".format(subwords))
    # Fill in the rest of the labels with "unknown".
    for unknown in range(1, encoder.vocab_size - len(encoder.subwords)):
        f.write("unknown #{}\n".format(unknown))
        
weights = tf.Variable(model.layers[0].get_weights()[0][1:])
checkpoint = tf.train.Checkpoint(embedding=weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = "metadata.tsv"
projector.visualize_embeddings(log_dir, config)

In [12]:
%tensorboard --logdir ./logs/imdb-example/

Reusing TensorBoard on port 6008 (pid 7108), started 0:04:03 ago. (Use '!kill 7108' to kill it.)

## Analysis

TensorBoard Projector 是解释和可视化嵌入的绝佳工具。 Dashboard 允许用户搜索特定 terms ，并突出显示嵌入（低维）空间中相邻的单词。从这个例子中，我们可以看到 Wes Anderson 和 Alfred Hitchcock 都是相当中立的术语，但它们在不同的上下文中被引用。

在这个空间中，`Hitchcock` 更接近 `nightmare` 等词语，这可能是因为他被称为 “Master of Suspense”，而 `Anderson` 更接近 `heart` 这个词，这与他  relentlessly detailed 和温馨风格一致。