<div style="background-color: black; color: white; padding: 10px;text-align: center;"> 
  <strong>Date Published:</strong> November 4, 2025  &nbsp; | &nbsp; <strong>Author:</strong> Adnan Alaref
</div> 

**1. What is a Sanity Check**  
> A sanity check is a minimal set of controlled experiments and inspections to verify that each
>  pipeline component (data ‚Üí model ‚Üí loss ‚Üí gradients ‚Üí learning loop) behaves as expected, before spending time training a full model.

Think of it like ‚Äúpre-flight checks‚Äù before takeoff ‚Äî you make sure nothing is fundamentally wrong.

# üß© Sanity Checks You Should Always Run

Here‚Äôs a checklist of essential sanity checks in deep learning ‚Äî use this as your **gold standard** for verifying training behavior and debugging issues.

| # | Check | What It Verifies | Expected Result |
|:-:|:------|:------------------|:----------------|
| 1 | **Shape Check** | Inputs, outputs, and labels align | No shape mismatch errors |
| 2 | **Forward Pass Check** | Model runs without errors | Reasonable logits (not NaN / inf) |
| 3 | **Loss Check** | Loss decreases over iterations | Loss ‚Üí smaller over time |
| 4 | **Overfit on One Batch** | Network can memorize small data | Accuracy ‚Üí 100% |
| 5 | **Gradient Flow Check** | Gradients aren‚Äôt zero or exploding | Mean(‚Äñgrad‚Äñ) ‚àà [1e-4, 1e-1] roughly |
| 6 | **Weight Update Check** | Parameters change after optimizer step | `(param_old - param_new).abs().mean() > 0` |
| 7 | **Learning Rate Check** | LR schedule behaves as expected | Observed LR == expected |
| 8 | **BatchNorm / Dropout Off at Eval** | Eval mode disables randomness | `model.eval()` ‚Üí deterministic output |
| 9 | **Loss‚ÄìAccuracy Correlation** | Lower loss improves accuracy | Loss ‚Üì ‚Üí Accuracy ‚Üë |
| 10 | **Gradient Clipping Sanity** | Clip thresholds applied properly | No exploding grads |

---

‚úÖ **Tip:** Run these checks *before* long training runs ‚Äî they can save hours (or days) of debugging.


# **üß± Step 1 ‚Äî Import Library**

In [1]:
import torch
import torch.nn as nn
from tqdm.auto import tqdm
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# **üß± Step 2 ‚Äî Setup Model On Toy Dataset**

In [2]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), # mean for each channel
                         (0.5,0.5,0.5)) # std for each channel
])

data = datasets.FakeData(size=200,
                         image_size=(3,32,32),
                         num_classes=10,
                         transform = transform
                        )
train_dataloader = DataLoader(dataset=data, batch_size = 16, shuffle=True)

# **‚öôÔ∏è Step 3 ‚Äî Define a Simple CNN**

In [3]:
class simplecnn(nn.Module):
  def __init__(self, num_classes=10) -> None:
    super().__init__()
    self.features = nn.Sequential(
      nn.Conv2d(3, 32, kernel_size=3, padding=1),
      nn.ReLU(),
      nn.Conv2d(32, 64, kernel_size=3, padding=1),
      nn.ReLU(),
      nn.MaxPool2d(2),   # 32x16x16
      nn.Conv2d(64, 128, kernel_size=3, padding=1),
      nn.ReLU(),
      nn.AdaptiveAvgPool2d(1)  # -> 128 x 1 x 1
    )
    self.classifier = nn.Linear(128,num_classes)

  def forward(self, x:torch.Tensor)->torch.Tensor:
    x = self.features(x)
    x = x.view(x.size(0),-1)
    x = self.classifier(x)
    return x

In [4]:
device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else  "cpu"
device

'cpu'

In [5]:
model = simplecnn().to(device)
criterion = nn.CrossEntropyLoss()
# optimzer = optim.SGD(model.parameters(), lr=0.1)
optimzer = optim.AdamW(model.parameters(), lr = 0.001, weight_decay=0.0)# weight_decay=0.0 for overfit test

In [6]:
from torchsummary import summary
summary(model, input_size=(3,32,32))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 32, 32, 32]             896
              ReLU-2           [-1, 32, 32, 32]               0
            Conv2d-3           [-1, 64, 32, 32]          18,496
              ReLU-4           [-1, 64, 32, 32]               0
         MaxPool2d-5           [-1, 64, 16, 16]               0
            Conv2d-6          [-1, 128, 16, 16]          73,856
              ReLU-7          [-1, 128, 16, 16]               0
 AdaptiveAvgPool2d-8            [-1, 128, 1, 1]               0
            Linear-9                   [-1, 10]           1,290
Total params: 94,538
Trainable params: 94,538
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 2.13
Params size (MB): 0.36
Estimated Total Size (MB): 2.50
---------------------------------------------

# **üîç Step 4 ‚Äî Data and Pipeline Integrity Checks**

## **A.Shape and Type Assertion:**  
Assert the shape and data type immediately before the data enters the model.

In [7]:
BATCH_SIZE, N_CHANNELS = 16, 3
batch_x, batch_y = next(iter(train_dataloader))

# Assert input shape (Batch, Channels, H, W) and type (float32)
assert batch_x.shape[0] == BATCH_SIZE
assert batch_x.shape[1] == N_CHANNELS
assert batch_x.dtype  == torch.float32

# Assert label shape ((Batch,) for classification) and type (int64)
assert batch_y.shape[0] == BATCH_SIZE
assert batch_y.ndim == 1 or batch_y.shape[-1] == 1

# Sanity: ensure labels dtype and range # **üß© Step 4 ‚Äî Sanity: ensure labels dtype and range**
assert batch_y.dtype == torch.long , "Labels must be torch.long for CrossEntropyLoss"
assert batch_y.max().item() < 10 and batch_y.min().item() >= 0, "Labels out of expected range"

## **B. Normalization/Scaling Check:**
Verify mean and std are close to expected values (0 and 1 for standardization, or 0.5 for simple min-max scaling).

In [8]:
# Check input normalization (after scaling/transforms)
mean = batch_x.mean().item()
std = batch_x.std().item()

# For standardization, check if mean is near 0 and std near 1
print(f"Input Mean: {mean:.4f}, Input Std: {std:.4f}")

# Corrected Assertion for transforms.Normalize((0.5), (0.5))
# The target standard deviation for this transform is ~0.577
EXPECTED_STD = 0.58 # Use 0.58 or be more strict with 0.577
'''
For uniformly distributed data in the range [0, 1], the theoretical standard deviation is 1/sqrt(12) ~ 0.288.
When you apply the transform, the range is doubled to [-1, 1], so the standard deviation is doubled to 2 * 0.288 ~ 0.577.
'''
assert abs(mean) < 0.1 # Check mean is close to 0
assert abs(std - EXPECTED_STD) < 0.01 # Check std is close to 1

Input Mean: 0.0022, Input Std: 0.5815


# **üß™Step 5 ‚Äî Initialization and Loss Sanity (The Zero-Step Check)**

## **A. Fixed Seeding for Reproducibility:** Must be the first thing you do.

In [9]:
import random
import numpy as np

def set_all_seeds(seed):
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)
  if torch.cuda.is_available():
      torch.cuda.manual_seed_all(seed)
      torch.backends.cudnn.deterministic = True
      torch.backends.cudnn.benchmark = False

set_all_seeds(42) # Set a fixed random seed

## **B. Expected Initial Loss Value:**
Calculate and assert the loss on the first batch with randomly initialized weights.
### üéØ Loss Function Sanity Check Guide

| **Loss Function**             | **Expected Initial Loss (Random P)** | **Code Check (Example: C = 10 Classes)** |
|-------------------------------|--------------------------------------|------------------------------------------|
| **Cross-Entropy (Classification)** | $-\log(1/C)$ | `expected_loss = np.log(10)`  ‚Üí **‚âà 2.30** |
| **Binary Cross-Entropy (BCE)** | $-\log(0.5)$ | `expected_loss = np.log(2)`  ‚Üí **‚âà 0.69** |
| **Mean Squared Error (MSE)** | Approx. $\text{Var}(\text{Targets})$ | Check variance of your target `y` |


In [10]:
N_CLASSES = train_dataloader.dataset.num_classes

model_v = simplecnn().to(device)
criterion = nn.CrossEntropyLoss()
initial_loss = criterion(model_v(batch_x),batch_y).item()

EXPECTED_INITIAL_LOSS = np.log(N_CLASSES)

print(f"Initial Loss: {initial_loss:.4f}, Expected: {EXPECTED_INITIAL_LOSS:.4f}")
assert abs(initial_loss - EXPECTED_INITIAL_LOSS) < 0.1 # Should be very close!

Initial Loss: 2.3006, Expected: 2.3026


# **üß© Step 6 ‚Äî Forward-Pass Sanity Check**

In [11]:
X, y = next(iter(train_dataloader))
X, y = X.to(device), y.to(device)

out = model(X)
print("Output shape:", out.shape)
print("Label shape:", y.shape)

Output shape: torch.Size([16, 10])
Label shape: torch.Size([16])


# **üí™ Step 7 ‚Äî Overfitting Sanity Check (The "One-Batch" Test)**
>The most critical test: Prove your model can learn perfectly before trying to generalize.

## **A. Isolate a Tiny Set:**
Use a single, isolated batch or a very small, fixed subset of the training data.

In [12]:
# Create a tiny, fixed subset data loader
tiny_indices = torch.randperm(len(data))[:100]
tiny_dataset = torch.utils.data.Subset(dataset=data, indices=tiny_indices)
tiny_loader = torch.utils.data.DataLoader(dataset=tiny_dataset, batch_size=32, shuffle=True)

In [13]:
y_batch = next(iter(tiny_loader))[1]
print(torch.unique(y_batch))

tensor([0, 1, 2, 3, 4, 5, 6, 8, 9])


In [14]:
print(y_batch.min().item(), y_batch.max().item())

0 9


## **B. Train to 100% Accuracy:**
Train for many epochs using a slightly higher learning rate, but only on this tiny set.

In [15]:
epochs = 200

X_batch, y_batch = next(iter(tiny_loader))
X_batch, y_batch = X_batch.to(device), y_batch.to(device)

for epoch in tqdm(range(epochs), desc="Training Model..!"):
  model.train()
  out = model(X_batch)
  loss = criterion(out, y_batch)

  optimzer.zero_grad()
  loss.backward()
  optimzer.step()

  # Track accuracy/loss at the end of the epochs
  pred = out.argmax(1)
  acc = (pred == y_batch).float().mean().item() *100
  if (epoch+1) % 10 == 0 or epoch == 0:
    print(f"Epoch [{epoch+1:3d}] | Loss: {loss.item():.4f} | Acc: {acc:.2f}%")

Training Model..!:   0%|          | 0/200 [00:00<?, ?it/s]

Epoch [  1] | Loss: 2.2890 | Acc: 18.75%
Epoch [ 10] | Loss: 2.2016 | Acc: 15.62%
Epoch [ 20] | Loss: 2.1948 | Acc: 15.62%
Epoch [ 30] | Loss: 2.1811 | Acc: 28.12%
Epoch [ 40] | Loss: 2.1483 | Acc: 21.88%
Epoch [ 50] | Loss: 2.0605 | Acc: 25.00%
Epoch [ 60] | Loss: 1.9186 | Acc: 25.00%
Epoch [ 70] | Loss: 1.7403 | Acc: 40.62%
Epoch [ 80] | Loss: 1.5583 | Acc: 50.00%
Epoch [ 90] | Loss: 1.3950 | Acc: 53.12%
Epoch [100] | Loss: 1.2314 | Acc: 68.75%
Epoch [110] | Loss: 1.0812 | Acc: 71.88%
Epoch [120] | Loss: 0.9435 | Acc: 78.12%
Epoch [130] | Loss: 0.8319 | Acc: 81.25%
Epoch [140] | Loss: 0.7210 | Acc: 87.50%
Epoch [150] | Loss: 0.6155 | Acc: 87.50%
Epoch [160] | Loss: 0.5276 | Acc: 93.75%
Epoch [170] | Loss: 0.4646 | Acc: 93.75%
Epoch [180] | Loss: 0.3768 | Acc: 100.00%
Epoch [190] | Loss: 0.3227 | Acc: 100.00%
Epoch [200] | Loss: 0.2650 | Acc: 100.00%


# **üî¨ Step 8 ‚Äî Gradient Sanity Check**
>Ensure gradients exist, are finite, and not vanishing

In [16]:
for name , params in model.named_parameters():
  if params.grad is None:
    print(f"[!] No grad for {name}")
  elif torch.isnan(params.grad).any():
    print(f"[!] NaN grad in {name}")
  else:
    print(f"[OK] {name} grad mean={params.grad.mean().item():.6f}")

[OK] features.0.weight grad mean=-0.000129
[OK] features.0.bias grad mean=-0.009620
[OK] features.2.weight grad mean=-0.000112
[OK] features.2.bias grad mean=-0.000454
[OK] features.5.weight grad mean=0.000025
[OK] features.5.bias grad mean=0.000045
[OK] classifier.weight grad mean=0.000000
[OK] classifier.bias grad mean=0.000000


# **üßÆ Step 9 ‚Äî Label Shuffle Test (Leakage Check)**
A: Expected: model fails to learn (accuracy ‚âà 10% for 10 classes).  
B: If accuracy increases ‚Äî your data pipeline might be leaking info.

In [17]:
# --- Step 1: Get one batch of data (tiny subset for the test)
X_batch, y_batch = next(iter(train_dataloader))
X_batch, y_batch = X_batch.to(device), y_batch.to(device)

# --- Step 2: Shuffle the labels (breaks the true mapping)
y_shuffled = y_batch[torch.randperm(len(y_batch))]

# --- Step 3: Reinitialize model + optimizer
model1 = simplecnn(num_classes=10).to(device)
optimizer = torch.optim.SGD(model1.parameters(), lr=0.1)
criterion = torch.nn.CrossEntropyLoss()

# --- Step 4: Train for a few epochs on the shuffled labels
epochs = 50
for epoch in tqdm(range(epochs), desc="üîç Label Shuffle Test (Leakage Check)"):
    model1.train()
    optimizer.zero_grad()

    out = model(X_batch)
    loss = criterion(out, y_shuffled)
    loss.backward()
    optimizer.step()

    # Compute accuracy
    pred = out.argmax(1)
    acc = (pred == y_shuffled).float().mean().item() * 100

    if (epoch + 1) % 10 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1:2d}] | Loss: {loss.item():.4f} | Acc: {acc:.2f}%")

# --- Step 5: Interpret
print("\n‚úÖ Expected result: Loss stays high (~2.0) and accuracy ~10‚Äì20% (random guessing).")


üîç Label Shuffle Test (Leakage Check):   0%|          | 0/50 [00:00<?, ?it/s]

Epoch [ 1] | Loss: 3.2162 | Acc: 18.75%
Epoch [10] | Loss: 3.2162 | Acc: 18.75%
Epoch [20] | Loss: 3.2162 | Acc: 18.75%
Epoch [30] | Loss: 3.2162 | Acc: 18.75%
Epoch [40] | Loss: 3.2162 | Acc: 18.75%
Epoch [50] | Loss: 3.2162 | Acc: 18.75%

‚úÖ Expected result: Loss stays high (~2.0) and accuracy ~10‚Äì20% (random guessing).


# **üß± Step 10 ‚Äî Activation Range Check (Optional but Valuable)**

In [18]:
with torch.no_grad():
  x_sample = X_batch[:1]
  for name, layer in model.named_modules():
    if isinstance(layer, nn.Conv2d):
      x_sample = layer(x_sample)
      print(f"{name} activation mean={x_sample.mean().item():.4f}, std={x_sample.std().item():.4f}")

features.0 activation mean=0.0701, std=0.5524
features.2 activation mean=0.0011, std=0.8876
features.5 activation mean=-0.9036, std=5.3058


**üß† Why This Matters**   
Activation statistics help catch dead layers or bad initialization early:

* If std ‚Üí 0, the layer is dead (no signal flows).
* If mean drifts far from 0 (> 0.5), activations may explode or saturate.
* Values show healthy signal propagation ‚Äî mean ‚âà 0, std ‚âà 0.1‚Äì0.4 ‚Üí ‚úÖ good!

# üß† Model Sanity Validation Report

| ‚úÖ Check | üîç What It Verifies | ‚öôÔ∏è Expected Behavior | üßæ Your Result | üü© Status |
|-----------|---------------------|----------------------|----------------|------------|
| **1. Forward Pass Check** | Model runs end-to-end without runtime errors | No NaNs or shape mismatch | ‚úÖ Model ran correctly | ‚úÖ PASS |
| **2. Shape Check** | Inputs/outputs match (`[B,3,32,32] ‚Üí [B,10]`) | Shapes consistent | ‚úÖ Verified with `(32,3,32,32) ‚Üí (32,10)` | ‚úÖ PASS |
| **3. Initial Loss Sanity** | Random model output loss ‚âà `log(C)` | For 10 classes ‚Üí ~2.30 | ‚úÖ 2.31 observed | ‚úÖ PASS |
| **4. Gradient Flow Check** | Ensure grads are finite and non-zero | Small non-zero grad means | ‚úÖ `~1e-4` in conv layers | ‚úÖ PASS |
| **5. Tiny Batch Overfit Test** | Model can memorize 1 batch (32 samples) | Loss ‚Üì ‚Üí ~0, Acc ‚Üí 100% | ‚úÖ Loss ‚Üì 2.3 ‚Üí 0.4, Acc ‚Üí 100% | ‚úÖ PASS |
| **6. Label Shuffle Test (Leakage Check)** | No learning when labels are randomized | Loss stays high (~2.0), Acc ~10‚Äì20% | ‚úÖ Loss ~1.9‚Äì2.3, Acc ~20% | ‚úÖ PASS |
| **7. Gradient Stability** | Gradients not NaN/Inf | All grads finite | ‚úÖ No NaN/Inf found | ‚úÖ PASS |
| **8. Numerical Stability** | Loss doesn‚Äôt explode | Loss steady, finite | ‚úÖ Stable throughout training | ‚úÖ PASS |

---

### ‚úÖ Summary
All core sanity checks **passed successfully**:
- No data leakage  
- Model and optimizer are configured correctly  
- Training pipeline behaves as expected  

üéØ You‚Äôre ready to scale to **full dataset training** or next-level experiments (e.g., LR warmup, batch norm tuning, data augmentations, etc.).


<a id="Import"></a>
<p style="background-color: #000000; font-family: 'Verdana', sans-serif; color: #FFFFFF; font-size: 160%; text-align: center; border-radius: 25px; padding: 12px 20px; margin-top: 20px; border: 2px solid transparent; background-image: linear-gradient(black, black), linear-gradient(45deg, #FF00FF, #00FFFF, #FFFF00, #FF4500); background-origin: border-box; background-clip: content-box, border-box; box-shadow: 0px 4px 20px rgba(255, 105, 180, 0.8);">
   Thanks & Upvote ‚ù§Ô∏è</p>