**PART 3**

**ADVANCED DEEP NETWORKS FOR COMPLEX PROBLEMS**

---

**CHAPTER 14 - TensorBoard: Big brother of TensorFlow**

---

### **14.1 Visualize data with TensorBoard**

TensorBoard is a comprehensive visualization toolkit included with TensorFlow. It acts as a vital companion to the framework, enabling practitioners to visualize high-dimensional data (such as images and text), track and monitor model metrics (like loss and accuracy) in real-time, and profile models to identify performance bottlenecks. It operates by reading event files from a designated logging directory and displaying them on a web-based dashboard.

**Scenario**: As a data scientist at a fashion company, you are tasked with building a model to identify clothing items. You select the **Fashion-MNIST** dataset for this task, which consists of grayscale images of 10 clothing categories. Before training, inspecting the data is crucial to ensure it is loaded correctly and that labels match the images. Visualizing raw data helps verify the integrity of the input pipeline.

**Data Loading**: We download the Fashion-MNIST dataset using `tensorflow_datasets` and process it into training, validation, and test sets. We create a pipeline that shuffles the data and batches it.

**Logging Images**: To visualize images, we use the `tf.summary.image()` function. This involves creating a `tf.summary.SummaryWriter` that points to a specific log directory. We typically include a timestamp in the directory name to distinguish between different runs. The writer is used within a context manager (`with image_writer.as_default():`) to log specific batches of images.

![Figure 14.1 Jupyter magic commands in a notebook cell](./14.Chapter-14/Figure14-1.jpg)
![Figure 14.2 The TensorBoard visualizing logged images, displayed inline in the Jupyter notebook](./14.Chapter-14/Figure14-2.jpg)

In [None]:
import tensorflow as tf
import datetime
from datetime import datetime
import tensorflow_datasets as tfds

# Define log directory with timestamp format YYYYMMDDHHMMSS
log_datetimestamp_format = "%Y%m%d%H%M%S"
log_datetimestamp = datetime.strftime(datetime.now(), log_datetimestamp_format)
image_logdir = "./logs/data_{}/train".format(log_datetimestamp)

# Create file writer
image_writer = tf.summary.create_file_writer(image_logdir)

# Log images
with image_writer.as_default():
    # Iterate through the dataset and write images using tf.summary.image
    # We can write individual images or batches. 'max_outputs' limits how many are stored.
    # for data in fashion_ds["train"].batch(1).take(10):
    #     tf.summary.image(id2label_map[int(data["label"].numpy())], data["image"], step=0)
    pass

### **14.2 Tracking and monitoring models with TensorBoard**

A primary use of TensorBoard is monitoring the training progress of deep learning models. Deep networks often take a long time to train, and waiting until the end to discover a failure is inefficient. By visualizing metrics like loss and accuracy in real-time, you can quickly identify if a model is failing to converge or overfitting, allowing you to stop training early and save time.

We compare two models on the Fashion-MNIST dataset:
1.  **Fully Connected Network (Dense)**: A simple Multilayer Perceptron with dense layers.
2.  **Convolutional Neural Network (CNN)**: A network using Conv2D and Pooling layers, which is generally better suited for image tasks as it captures spatial hierarchies.

**Organizing Runs**: To compare these models effectively, we save their logs to separate subdirectories (e.g., `./logs/dense/run_1` vs `./logs/conv/run_1`). TensorBoard treats each subdirectory as a separate "run" and allows you to toggle them on and off for comparison. A robust naming convention usually includes the model type and a timestamp.

**The TensorBoard Callback**: In Keras, we use the `tf.keras.callbacks.TensorBoard` callback. Important arguments include:
* `log_dir`: The path where log files will be written.
* `histogram_freq`: How often (in epochs) to compute activation and weight histograms. This helps analyze the distribution of values within layers.
* `write_graph`: Whether to visualize the model's computation graph (defaults to True).
* `profile_batch`: Which batches to capture for performance profiling (discussed later).

![Figure 14.3 How tracked metrics are displayed on the TensorBoard](./14.Chapter-14/Figure14-3.jpg)

The scalar dashboard plots metrics over time. You can use the smoothing slider to eliminate noise and see global trends. You can also toggle between linear and log scales for the y-axis.

![Figure 14.4 How the smoothing parameter changes the line plot](./14.Chapter-14/Figure14-4.jpg)
![Figure 14.5 Viewing metrics of both the dense model and the convolutional model](./14.Chapter-14/Figure14-5.jpg)

**Activation Histograms**: TensorBoard can display histograms of the weights and activations of your model layers over time. This helps diagnose issues like vanishing gradients (activations becoming zero) or exploding gradients (activations growing too large). Histograms from different epochs are stacked, with darker colors representing more recent epochs.

![Figure 14.6 Activation histograms displayed by the TensorBoard](./14.Chapter-14/Figure14-6.jpg)

In [None]:
# Define TensorBoard callback for the Dense model
dense_log_dir = "logs/dense_{}".format(log_datetimestamp)
tb_callback = tf.keras.callbacks.TensorBoard(
    log_dir=dense_log_dir,
    histogram_freq=1,  # Log histograms every epoch to visualize weight distributions
    profile_batch=0    # Disable profiling for this specific run
)

# Model training would look like this:
# dense_model.fit(tr_ds, validation_data=v_ds, epochs=10, callbacks=[tb_callback])

### **14.3 Using tf.summary to write custom metrics during model training**

While Keras provides standard metrics (accuracy, loss), research often requires tracking custom values that are not part of the standard set. For instance, you might want to analyze the stability of your layer weights by tracking their mean and standard deviation during training to see the effect of Batch Normalization.

Since these aren't standard Keras metrics that can be passed to `model.compile`, we implement a **custom training loop**. This gives us granular control over the training process and logging.

1.  **Define Models**: We define two versions of the model: one with `BatchNormalization` layers and one without, to compare their weight statistics.
2.  **Custom Loop**: We use `tf.summary.create_file_writer` to instantiate a writer. Inside the training loop (iterating over epochs and batches), we manually calculate the metrics.
3.  **Log Scalars**: We use `tf.summary.scalar()` to log the calculated mean and standard deviation of the weights at specific steps. It is important to call `writer.flush()` to ensure data is written to the disk buffer immediately.

The visualization in TensorBoard allows us to see that with Batch Normalization, the mean and standard deviation of weights vary much more compared to when it is not used, providing insight into how normalization affects internal layer statistics.

![Figure 14.8 The mean and standard deviation of weights plotted in the TensorBoard](./14.Chapter-14/Figure14-8.jpg)

In [None]:
import numpy as np

def train_model(model, dataset, log_dir, log_layer_name, epochs):
    # Create a summary writer for the custom log directory
    writer = tf.summary.create_file_writer(log_dir)
    step = 0
    
    # Use the writer context manager
    with writer.as_default():
        for e in range(epochs):
            for batch in dataset:
                # Perform training step here (e.g., model.train_on_batch)
                # ...
                
                # Extract weights from the specific layer we want to analyze
                weights = model.get_layer(log_layer_name).get_weights()[0]
                
                # Log custom metrics using tf.summary.scalar
                tf.summary.scalar("mean_weights", np.mean(np.abs(weights)), step=step)
                tf.summary.scalar("std_weights", np.std(np.abs(weights)), step=step)
                
                writer.flush()
                step += 1

### **14.4 Profiling models to detect performance bottlenecks**

TensorBoard Profiler is a powerful tool to analyze where your model spends its time during execution (e.g., is it waiting for data? is the GPU idle? is the kernel launch slow?). Identifying these bottlenecks is the first step toward optimization.

**Prerequisites**: You need to install the `tensorboard_plugin_profile` package and the NVIDIA CUDA Profiling Toolkit Interface (`libcupti`). On Windows, this often requires specific configuration, such as copying DLLs to the CUDA bin folder and enabling GPU performance counters for all users in the NVIDIA Control Panel.

**Running the Profiler**: We enable profiling by passing the `profile_batch` argument to the TensorBoard callback. For example, `profile_batch=[10, 20]` tells TensorFlow to profile batches 10 through 20. TensorBoard generates a report with a performance summary, step-time graph, and specific recommendations.

![Figure 14.10 TensorBoard profiling interface](./14.Chapter-14/Figure14-10.jpg)

#### **Optimizing the input pipeline**
If the profiler shows high "Input Time," it means the GPU is starving for data because the CPU cannot prepare batches fast enough. We can optimize the `tf.data` pipeline:
1.  **Parallel Mapping**: Use `num_parallel_calls=tf.data.AUTOTUNE` in the `map` function. This tells TensorFlow to use multiple threads to process data transformations in parallel.
2.  **Prefetching**: Add `.prefetch(tf.data.experimental.AUTOTUNE)` at the end of the pipeline. This allows the CPU to prepare the next batch of data while the GPU is currently processing the previous batch, effectively overlapping preprocessing and model execution.
3.  **Kernel Launch Latency**: If the CPU is too busy to launch GPU kernels in time (high Kernel Launch time), we can set the environment variable `TF_GPU_THREAD_MODE=gpu_private`. This dedicates specific threads solely for launching GPU kernels, preventing them from being blocked by other CPU tasks.

![Figure 14.11 Side-by-side comparison of the profiling overview with and without data- and model-related optimizations](./14.Chapter-14/Figure14-11.jpg)

#### **Mixed precision training**
Standard training uses 32-bit floating-point numbers (`float32`). This is memory-intensive and slower on modern GPUs equipped with Tensor Cores. **Mixed Precision Training** uses 16-bit floats (`float16`) for operations (like matrix multiplications) while keeping variables (weights) in `float32` for numerical stability. This speeds up math significantly and reduces memory usage, often allowing for larger batch sizes.

TensorBoard's memory profile view allows verifying the reduction in memory consumption when switching to mixed precision.

![Figure 14.12 Memory profile with and without the optimizations](./14.Chapter-14/Figure14-12.jpg)

In [None]:
from tensorflow.keras import mixed_precision

# Enable mixed precision training globally
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

# Optimized Data Pipeline Example
# dataset = dataset.map(get_image_and_label, num_parallel_calls=tf.data.AUTOTUNE)
# dataset = dataset.prefetch(tf.data.AUTOTUNE)

### **14.5 Visualizing word vectors with the TensorBoard**

Visualizing embeddings is crucial in NLP to verify if the model has learned semantic relationships (e.g., ensuring "cat" is closer to "dog" than to "car"). TensorBoard's **Embedding Projector** allows us to visualize high-dimensional vectors in 2D or 3D space.

**Process**:
1.  **Load Vectors**: We load pretrained word vectors (like **GloVe**) into a pandas DataFrame. These vectors capture semantic meaning based on co-occurrence statistics.
2.  **Save Checkpoint**: The embeddings must be saved as a `tf.Variable` within a TensorFlow checkpoint file (`.ckpt`). This is the format TensorBoard reads.
3.  **Save Metadata**: To label the points in the visualization, we save a tab-separated value (TSV) file containing the words corresponding to each vector row.
4.  **Configure Projector**: We use `tensorboard.plugins.projector` to write a configuration file (`projector_config.pbtxt`) that links the checkpoint to the metadata file.

TensorBoard provides algorithms like **PCA** (Principal Component Analysis), **t-SNE**, and **UMAP** to project these high-dimensional embeddings into 2D or 3D space. The interface allows searching for specific words using regular expressions to highlight clusters and verify semantic relationships.

![Figure 14.13 The word vector view on the TensorBoard](./14.Chapter-14/Figure14-13.jpg)
![Figure 14.14 Searching words in the visualizations](./14.Chapter-14/Figure14-14.jpg)

In [None]:
from tensorboard.plugins import projector
import os

def visualize_embeddings(log_dir, embeddings_df):
    # 1. Create Variable and Save Checkpoint
    # Convert dataframe values to a TensorFlow variable
    weights = tf.Variable(embeddings_df.values)
    checkpoint = tf.train.Checkpoint(embedding=weights)
    checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

    # 2. Save Metadata (Labels)
    # Write the words (index of dataframe) to a TSV file
    with open(os.path.join(log_dir, 'metadata.tsv'), 'w') as f:
        for w in embeddings_df.index:
            f.write(w + '\n')

    # 3. Configure Projector
    config = projector.ProjectorConfig()
    embedding = config.embeddings.add()
    # The tensor name usually follows a specific pattern in the checkpoint
    embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
    embedding.metadata_path = 'metadata.tsv'
    
    # Write the config file for TensorBoard
    projector.visualize_embeddings(log_dir, config)