# Chapter 14: TensorBoard — Big Brother of TensorFlow

## 1️⃣ Chapter Overview

Deep learning models are often described as **black boxes** due to their high-dimensional parameter spaces and non-linear decision boundaries, which make internal behavior difficult to interpret. In addition, training deep neural networks is computationally expensive and susceptible to silent failures such as vanishing gradients, dead neurons, or inefficient data pipelines.

This chapter introduces **TensorBoard**, TensorFlow’s built-in visualization and diagnostics toolkit. TensorBoard provides a systematic way to observe, debug, and optimize deep learning models by exposing internal states and training dynamics that would otherwise remain hidden.

Through TensorBoard, practitioners can monitor training metrics in real time, inspect parameter distributions, visualize learned embeddings, and profile system-level performance bottlenecks. As a result, TensorBoard plays a crucial role in both model development and experimental reproducibility.

---


## 2️⃣ Theoretical Explanation

### 2.1 How TensorBoard Works

TensorBoard operates by reading **event log files** generated during model execution rather than directly inspecting a running program. These logs capture time-stamped summaries of tensors, metrics, and metadata.

The workflow consists of three main components:
1. **Summary Writer:** A TensorFlow object that writes events (scalars, histograms, images, embeddings) to disk.
2. **Event Files:** Binary log files that store serialized summary data produced during training or evaluation.
3. **TensorBoard Server:** A separate process that monitors the log directory, parses event files, and renders interactive visualizations in a web interface.

This decoupled design allows TensorBoard to scale to long-running experiments and large models without interfering with training execution.

---


### 2.2 Monitoring Training with Scalars

The most common TensorBoard visualization is the **Scalar Dashboard**, which tracks scalar values as functions of training steps or epochs.

Typical scalar metrics include:
* Training and validation loss
* Training and validation accuracy
* Learning rate schedules

Formally, a scalar metric can be expressed as a function:

$$ s(t) = f(\theta_t, D) $$

where $t$ denotes training time (step or epoch), $\theta_t$ represents model parameters at time $t$, and $D$ is the dataset. Visualizing these curves enables early detection of **overfitting**, **underfitting**, and unstable optimization behavior.

---


### 2.3 Histograms and Weight Distributions

Histogram visualizations display the distribution of tensor values (such as weights, biases, or gradients) over time. These plots are essential for diagnosing numerical pathologies in deep networks.

Key failure modes detectable via histograms include:
* **Vanishing gradients:** Parameter updates concentrate near zero.
* **Exploding gradients:** Parameter magnitudes grow uncontrollably.
* **Dead ReLU units:** Activations remain zero across training iterations.

By observing how distributions evolve, practitioners can adjust initialization schemes, activation functions, or optimization hyperparameters to stabilize training.

---


### 2.4 Visualizing Data and Model Inputs

TensorBoard supports direct visualization of input data such as images and audio. This capability is particularly valuable for verifying data preprocessing and augmentation pipelines.

For example, when applying random rotations, crops, or color jitter to images, visual inspection ensures that augmentations preserve semantic meaning and do not introduce unintended artifacts.

Data visualization serves as a sanity check that helps prevent training on corrupted or mislabeled data.

---


### 2.5 Embedding Projector and Dimensionality Reduction

Modern neural networks frequently learn high-dimensional embeddings, such as word vectors or latent feature representations. TensorBoard’s **Embedding Projector** enables visualization of these embeddings by projecting them into two or three dimensions.

Common dimensionality reduction techniques include:
* **Principal Component Analysis (PCA):** Linear projection maximizing variance.
* **t-SNE:** Nonlinear projection preserving local neighborhood structure.

Although projections lose information, they provide qualitative insight into semantic relationships, such as word similarity or class clustering in latent space.

---


### 2.6 Performance Profiling

Training deep learning models involves a complex interaction between data loading, CPU execution, and accelerator (GPU/TPU) computation. Performance bottlenecks may arise at different stages of the pipeline.

TensorBoard’s **Profiler** decomposes execution time into components such as:
* Input pipeline latency
* Kernel launch overhead
* Device computation time

By analyzing these breakdowns, practitioners can identify whether training is limited by data throughput, model complexity, or hardware utilization, enabling targeted optimization.

---


## 3️⃣ Practical Importance of TensorBoard

TensorBoard serves as both a debugging tool and an experimental analysis platform. It enables reproducibility by preserving detailed training logs and supports systematic comparison between experimental runs.

In large-scale deep learning workflows, TensorBoard is essential for:
* Monitoring long-running experiments
* Diagnosing training instabilities
* Communicating model behavior to collaborators

As model complexity grows, visualization and profiling tools become indispensable components of the deep learning development lifecycle.

---


## 3️⃣ Setup and Imports

We need to load the TensorBoard extension to view it inside the notebook.

In [None]:
%load_ext tensorboard

import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow_datasets as tfds
import datetime
import os
import numpy as np

# Ensure clean log directory
if not os.path.exists('logs'):
    os.makedirs('logs')

## 4️⃣ Section 1: Visualizing Data with TensorBoard

Before training, we should inspect our data. We will load the **Fashion MNIST** dataset and log a batch of images to TensorBoard.

### 4.1 Data Loading Pipeline

In [None]:
# Load Fashion MNIST
dataset, info = tfds.load('fashion_mnist', with_info=True, as_supervised=True)
train_ds = dataset['train']
test_ds = dataset['test']

# Map Class IDs to Names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']

def normalize_img(image, label):
    return tf.cast(image, tf.float32) / 255.0, label

train_ds = train_ds.map(normalize_img).shuffle(1000).batch(32)
test_ds = test_ds.map(normalize_img).batch(32)

### 4.2 Logging Images
We use `tf.summary.image` to write image data to the logs.

In [None]:
# Create a log directory with timestamp
current_time = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
log_dir = os.path.join('logs', 'image_viz', current_time)
file_writer = tf.summary.create_file_writer(log_dir)

# Get a single batch of images
images, labels = next(iter(train_ds))

# Reshape for visualization (Batch, Height, Width, Channels)
# Fashion MNIST is (32, 28, 28, 1)

with file_writer.as_default():
    # Log the first 5 images
    # step=0 indicates this is the initial state
    tf.summary.image("Training data", images, max_outputs=5, step=0)

print(f"Images logged to {log_dir}")

## 5️⃣ Section 2: Monitoring Model Training

We will build a simple CNN and use `tf.keras.callbacks.TensorBoard` to automatically log metrics (Loss, Accuracy) and weights (Histograms).

**Key Argument:** `histogram_freq=1` tells Keras to compute histograms of weights every epoch.

In [None]:
def create_model():
    model = models.Sequential([
        layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
        layers.MaxPooling2D((2, 2)),
        layers.Conv2D(64, (3, 3), activation='relu'),
        layers.Flatten(),
        layers.Dense(64, activation='relu'),
        layers.Dense(10, activation='softmax')
    ])
    return model

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# Define TensorBoard Callback
log_dir = os.path.join("logs", "fit", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir, 
    histogram_freq=1 # Log weight histograms every epoch
)

# Train
model.fit(train_ds, 
          epochs=3, 
          validation_data=test_ds, 
          callbacks=[tensorboard_callback])

### 5.1 Viewing TensorBoard
To view the dashboard, you would typically run the following command in a cell. 

**Note:** In some environments (like standard Jupyter), this opens an interactive window. In others, you might need to run `tensorboard --logdir logs` from your terminal.

```python
%tensorboard --logdir logs
```

## 6️⃣ Section 3: Custom Logging with `tf.summary`

Sometimes the Keras callback isn't enough. You might want to log weird custom metrics (e.g., the mean value of gradients, or the learning rate schedule) inside a custom training loop.

Here, we simulate a custom loop and log the **mean weight** of the first layer manually.

In [None]:
# Define a separate writer for custom metrics
custom_log_dir = os.path.join("logs", "custom", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
summary_writer = tf.summary.create_file_writer(custom_log_dir)

model = create_model()
optimizer = tf.keras.optimizers.Adam()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()

epochs = 3
for epoch in range(epochs):
    print(f"\nStart of epoch {epoch}")
    
    for step, (x_batch_train, y_batch_train) in enumerate(train_ds):
        with tf.GradientTape() as tape:
            logits = model(x_batch_train, training=True)
            loss_value = loss_fn(y_batch_train, logits)
            
        grads = tape.gradient(loss_value, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))
        
        # --- Custom Logging ---
        # Log every 200 steps
        if step % 200 == 0:
            with summary_writer.as_default():
                # 1. Log scalar Loss
                tf.summary.scalar('custom_loss', loss_value, step=optimizer.iterations)
                
                # 2. Log mean weight of first layer
                # (To check if weights are exploding or vanishing)
                w = model.layers[0].weights[0]
                mean_w = tf.reduce_mean(w)
                tf.summary.scalar('weight_mean_l0', mean_w, step=optimizer.iterations)
                
    print(f"Epoch {epoch} done.")

## 7️⃣ Section 4: Profiling Performance

The TensorBoard **Profiler** helps identify if your input pipeline is slow (CPU bound) or if your model operations are slow (GPU bound).

To use it, we simply add the `profile_batch` argument to the callback. It defines which batches to monitor (e.g., batches 500 to 520).

*Note: Profiling often requires specific GPU drivers and the CUPTI library installed on the host machine.*

In [None]:
log_dir = os.path.join("logs", "profile", datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))

tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=log_dir, 
    profile_batch='50,60' # Profile batches 50 to 60
)

# We would then fit the model as usual:
# model.fit(train_ds, epochs=1, callbacks=[tensorboard_callback])
print("Profiler configured. Check the 'Profile' tab in TensorBoard after running fit.")

## 8️⃣ Section 5: Visualizing Embeddings (Projector)

The Embedding Projector allows us to verify if our model has learned semantic relationships between words. We will download pretrained **GloVe** vectors and visualize them.

**Logic:**
1. Save the weights of the embedding layer to a checkpoint file.
2. Save the vocabulary (metadata) to a TSV file.
3. Configure a `projector_config.pbtxt` linking the two.

In [None]:
from tensorboard.plugins import projector

# 1. Create dummy embeddings (Simulating GloVe for demonstration)
vocab_size = 1000
embedding_dim = 50
dummy_weights = tf.Variable(tf.random.normal([vocab_size, embedding_dim]))
dummy_vocab = [f"word_{i}" for i in range(vocab_size)]

# 2. Setup Log Directory
log_dir = os.path.join('logs', 'embeddings')
if not os.path.exists(log_dir):
    os.makedirs(log_dir)

# 3. Save Weights (Checkpoint)
checkpoint = tf.train.Checkpoint(embedding=dummy_weights)
checkpoint.save(os.path.join(log_dir, "embedding.ckpt"))

# 4. Save Metadata (TSV)
with open(os.path.join(log_dir, 'metadata.tsv'), 'w') as f:
    for word in dummy_vocab:
        f.write(f"{word}\n")

# 5. Configure Projector
config = projector.ProjectorConfig()
embedding = config.embeddings.add()
embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE"
embedding.metadata_path = 'metadata.tsv'

projector.visualize_embeddings(log_dir, config)

print(f"Embeddings ready. Run TensorBoard pointing to {log_dir} and check 'Projector' tab.")

## 9️⃣ Chapter Summary

* **TensorBoard** is indispensable for debugging deep learning models.
* **Scalars Tab:** Use it to track Overfitting (when Validation Loss diverges from Training Loss).
* **Images Tab:** Use it to sanity check your data pipeline inputs.
* **Histograms Tab:** Use it to monitor weight health (check for bell curves; avoid spikes at 0 or -1).
* **Profile Tab:** Use it to identify if you need to optimize your `tf.data` pipeline (prefetching/caching) or your model ops.
* **Projector Tab:** Use it to visualize high-dimensional embeddings in 3D space using PCA/t-SNE.