# üñºÔ∏è Convolutional Neural Networks (CNNs) & Classic Architectures

## **From Image Recognition to Semiconductor Defect Detection**

---

### **üìö Learning Objectives**

By the end of this notebook, you will:

1. ‚úÖ **Understand convolutions:** How convolution operations extract spatial features from images
2. ‚úÖ **Master CNN building blocks:** Convolutions, pooling, normalization, activation functions
3. ‚úÖ **Implement from scratch:** Build convolution operation using NumPy (educational)
4. ‚úÖ **Build CNNs in PyTorch & Keras:** Practical implementations with modern frameworks
5. ‚úÖ **Study classic architectures:** LeNet, AlexNet, VGG, Inception, ResNet (evolution of CNNs)
6. ‚úÖ **Apply to semiconductor testing:** Wafer map defect pattern classification (20+ defect types)
7. ‚úÖ **Transfer learning:** Use pre-trained models (ImageNet) for semiconductor applications
8. ‚úÖ **Optimize for production:** Model compression, inference speed, deployment strategies

---

## **Why Convolutional Neural Networks?**

### **The Image Recognition Revolution**

**Before CNNs (pre-2012):**
- Hand-crafted features (SIFT, HOG, Haar cascades)
- Shallow models (SVM, Random Forest on engineered features)
- **ImageNet accuracy:** ~75% (2011 winner)

**After CNNs (2012-present):**
- Automatic feature learning from data
- Deep hierarchical representations
- **ImageNet accuracy:** 88%+ (ResNet-152, 2015), superhuman performance

**Breakthrough moment:** AlexNet (2012) won ImageNet with 84.7% accuracy (vs 73.8% second place) using CNNs + GPUs.

---

### **Why CNNs Work for Images**

**Traditional fully-connected networks fail for images:**
- **Too many parameters:** 224√ó224√ó3 image = 150K pixels ‚Üí 150K√ó1000 = 150M weights for first layer alone!
- **No spatial structure:** Treats pixels as independent features (ignores proximity)
- **Not translation invariant:** Cat in top-left ‚â† cat in bottom-right (must learn separately)

**CNNs solve these problems:**
- ‚úÖ **Parameter sharing:** Same filter applied to entire image (1K parameters vs 150M)
- ‚úÖ **Spatial locality:** Convolutions preserve 2D structure (nearby pixels processed together)
- ‚úÖ **Translation invariance:** Same filter detects features anywhere in image
- ‚úÖ **Hierarchical features:** Low-level edges ‚Üí mid-level textures ‚Üí high-level objects

---

### **Semiconductor Post-Silicon Validation Use Case**

**Challenge:** Classify defect patterns on semiconductor wafer maps.

**Wafer map:** 2D spatial representation of die pass/fail status on a wafer (300mm diameter, 10K-50K die).

**Defect patterns indicate root causes:**
- **Ring pattern:** Chamber conditioning issue (plasma uniformity)
- **Edge failures:** Edge exclusion problem (contamination)
- **Cluster defects:** Particle contamination (specific wafer location)
- **Scratch pattern:** Wafer handling damage (linear defect)
- **Radial pattern:** Process gradient (center-to-edge variation)
- **Random failures:** Random defects (no spatial correlation)

**Business value:** $5M-$20M per incident through **faster root cause analysis** (reduce time from days/weeks to hours).

**Why CNNs excel:**
- Spatial patterns (not just individual die failures)
- Translation invariance (defect can appear anywhere on wafer)
- Rotation invariance (with data augmentation)
- Pre-trained models (transfer learning from ImageNet)

---

### **CNN Applications Beyond Image Recognition**

| Domain | Application | Input | Output |
|--------|-------------|-------|--------|
| **Computer Vision** | Object detection | Image | Bounding boxes + classes |
| **Computer Vision** | Semantic segmentation | Image | Pixel-wise labels |
| **Computer Vision** | Face recognition | Face image | Identity |
| **Medical Imaging** | Tumor detection | CT/MRI scan | Tumor location + type |
| **Autonomous Driving** | Scene understanding | Camera feed | Objects, lanes, signs |
| **Semiconductor** | Wafer defect classification | Wafer map (2D) | Defect type (20+ classes) |
| **Semiconductor** | Die image inspection | SEM/optical images | Defect detection |
| **Time-Series** | 1D CNNs for signals | Sensor data | Anomaly detection |
| **NLP** | Text classification | Word embeddings | Sentiment/category |

---

## **What We'll Build**

### **1. Educational: Convolution from Scratch (NumPy)**

Implement convolution operation to understand the math:
```
Input: 28√ó28 grayscale image
Filter: 3√ó3 edge detector
Output: 26√ó26 feature map
```

### **2. Production: Wafer Map Defect Classifier (PyTorch + Keras)**

**Architecture (Custom CNN):**
```
Input(300√ó300√ó1) ‚Üí Conv2D(32, 5√ó5, stride=1) + ReLU + MaxPool(2√ó2)
                 ‚Üí Conv2D(64, 3√ó3, stride=1) + ReLU + MaxPool(2√ó2)
                 ‚Üí Conv2D(128, 3√ó3, stride=1) + ReLU + MaxPool(2√ó2)
                 ‚Üí Flatten
                 ‚Üí Dense(256) + Dropout(0.5)
                 ‚Üí Output(20, Softmax)  # 20 defect classes
```

**Dataset:** 10K wafer maps (300√ó300 pixels), 20 defect classes.

**Metrics:** Top-1 accuracy ‚â•95%, top-3 accuracy ‚â•98%, inference <100ms.

---

### **3. Transfer Learning: ResNet-50 Fine-Tuned on Wafer Maps**

**Why transfer learning?**
- Pre-trained on ImageNet (1.2M images, 1000 classes)
- Low-level features (edges, textures) transfer to wafer maps
- Requires 10√ó less data (1K samples vs 10K+ from scratch)
- Faster convergence (5 epochs vs 50+ from scratch)

**Approach:**
1. Load pre-trained ResNet-50 (ImageNet weights)
2. Replace final layer (1000 classes ‚Üí 20 defect classes)
3. Freeze early layers (keep learned features)
4. Fine-tune last few layers on wafer map data
5. Optionally unfreeze all layers for full fine-tuning

**Expected results:** 98%+ accuracy with 1K training samples (vs 95% from scratch with 10K samples).

---

## **Notebook Roadmap**

### **Part 1: Convolution Fundamentals** (Cells 2-3)
- Mathematical definition of convolution
- From-scratch implementation (NumPy)
- Convolution properties (padding, stride, dilation)
- Pooling operations (max, average, global)

### **Part 2: CNN Building Blocks** (Cell 4)
- Conv2D, BatchNorm, ReLU, MaxPool, Dropout
- Building custom CNNs in PyTorch and Keras
- Training on synthetic wafer maps

### **Part 3: Classic CNN Architectures** (Cell 5)
- **LeNet-5 (1998):** First successful CNN (MNIST)
- **AlexNet (2012):** ImageNet breakthrough (ReLU, dropout, GPU)
- **VGG (2014):** Simplicity of 3√ó3 filters (depth matters)
- **Inception/GoogLeNet (2014):** Multi-scale features (1√ó1, 3√ó3, 5√ó5 in parallel)
- **ResNet (2015):** Skip connections (train 152+ layers)

### **Part 4: Transfer Learning & Fine-Tuning** (Cell 6)
- Load pre-trained ResNet-50 from ImageNet
- Fine-tune on wafer map dataset
- Compare with training from scratch
- Feature extraction vs full fine-tuning

### **Part 5: Production Deployment** (Cell 7)
- Model optimization (pruning, quantization, ONNX)
- Real-time inference (<100ms per wafer)
- Deployment strategies (TorchServe, TF Serving)
- Explainability (Grad-CAM visualization)

### **Part 6: Real-World Projects** (Cell 8)
- 8 comprehensive projects (4 semiconductor + 4 general AI/ML)
- Business value, architecture, success metrics
- Key takeaways and best practices

---

## **Architecture Evolution Timeline**

```mermaid
graph LR
    A[LeNet-5<br/>1998<br/>~60K params] --> B[AlexNet<br/>2012<br/>60M params<br/>ImageNet winner]
    B --> C[VGG-16<br/>2014<br/>138M params<br/>Simple, deep]
    B --> D[Inception v1<br/>2014<br/>6M params<br/>Multi-scale]
    C --> E[ResNet-50<br/>2015<br/>25M params<br/>Skip connections]
    D --> E
    E --> F[EfficientNet<br/>2019<br/>5M params<br/>NAS + scaling]
    E --> G[Vision Transformer<br/>2020<br/>86M params<br/>Attention-based]
    
    style B fill:#ff9999
    style E fill:#99ff99
    style G fill:#9999ff
```

**Key milestones:**
- **1998:** LeNet-5 (handwritten digits, ~99% MNIST)
- **2012:** AlexNet (ImageNet breakthrough, 84.7% top-5 accuracy)
- **2014:** VGG (simplicity wins), Inception (multi-scale features)
- **2015:** ResNet (**revolution**, 152 layers, 96% ImageNet)
- **2017:** DenseNet (connect everything), MobileNet (mobile efficiency)
- **2019:** EfficientNet (compound scaling, SOTA efficiency)
- **2020:** Vision Transformer (attention replaces convolutions)

---

## **Prerequisites**

**Required:**
- ‚úÖ Notebook 051: Neural Networks Foundations (backpropagation, gradients)
- ‚úÖ Notebook 052: Deep Learning Frameworks (PyTorch, Keras)
- ‚úÖ Linear algebra basics (matrix multiplication, convolution as matrix operation)
- ‚úÖ Image processing basics (pixels, channels, RGB vs grayscale)

**Helpful:**
- Basic understanding of image filters (blur, edge detection)
- Familiarity with image datasets (MNIST, CIFAR-10, ImageNet)
- Experience with data augmentation (rotation, flip, zoom)

---

## **Installation & Setup**

```bash
# Core libraries (already installed in Notebook 052)
pip install torch torchvision tensorflow

# Additional for CNNs
pip install opencv-python  # Image processing
pip install albumentations  # Advanced data augmentation
pip install timm  # PyTorch Image Models (pre-trained CNNs)
pip install grad-cam  # Explainability (visualize what CNN learned)

# For wafer map generation
pip install scikit-image  # Spatial pattern generation
```

**Check installation:**
```python
import torch
import torchvision
import tensorflow as tf
import cv2
print(f"PyTorch: {torch.__version__}")
print(f"TorchVision: {torchvision.__version__}")
print(f"TensorFlow: {tf.__version__}")
print(f"OpenCV: {cv2.__version__}")
```

---

## **Dataset Overview**

### **Synthetic Wafer Map Dataset (For This Notebook)**

We'll generate synthetic wafer maps with realistic defect patterns:

**Patterns:**
1. **None (normal):** Random fail rate 0.5-2%
2. **Center:** High failure in center (Gaussian distribution)
3. **Edge:** High failure at wafer edge (ring pattern)
4. **Scratch:** Linear defect (horizontal, vertical, or diagonal)
5. **Ring:** Circular ring of failures (process uniformity issue)
6. **Near-full:** Almost all die fail (catastrophic failure)
7. **Donut:** Ring with good center (edge exclusion + center good)
8. **Cluster:** Localized cluster of failures (particle contamination)
9. **Random:** Completely random failures (no spatial pattern)
10. **Radial:** Radial gradient (center-to-edge variation)
... (20 total classes)

**Dataset split:**
- Training: 7,000 wafer maps (350 per class)
- Validation: 1,500 wafer maps (75 per class)
- Test: 1,500 wafer maps (75 per class)

---

## **Success Metrics**

**Model Performance:**
- **Top-1 accuracy:** ‚â•95% (correct class)
- **Top-3 accuracy:** ‚â•98% (correct class in top 3 predictions)
- **Precision/Recall:** ‚â•90% per class (balanced performance)
- **Confusion matrix:** Identify commonly confused patterns

**Production Requirements:**
- **Inference time:** <100ms per wafer (CPU), <10ms (GPU)
- **Model size:** <50MB (for edge deployment)
- **Explainability:** Grad-CAM visualization of decision regions

**Business Impact:**
- **Root cause time:** Reduce from days/weeks ‚Üí hours
- **Cost savings:** $5M-$20M per incident through faster resolution
- **Accuracy:** 95%+ reduces false alarms (minimize wasted engineering time)

---

## **Learning Path Context**

**Where we are:**
- ‚úÖ **Notebook 051:** Neural networks from scratch (MLPs, backpropagation)
- ‚úÖ **Notebook 052:** Deep learning frameworks (PyTorch, Keras, ONNX)
- üî• **Notebook 053:** Convolutional networks (spatial features, images)

**Where we're going:**
- üìò **Notebook 054:** Recurrent networks (RNNs, LSTMs for sequences)
- üìò **Notebook 055:** Transformers and attention (modern NLP and vision)
- üìò **Notebook 056:** Generative models (GANs, VAEs, diffusion)
- üìò **Notebook 057:** Reinforcement learning (agents, policies)

---

## **Time Investment**

- **Reading + code execution:** 4-5 hours
- **Practice exercises:** 2-3 hours
- **Real-world project:** 8-12 hours (wafer map classifier)
- **Total:** 14-20 hours for mastery

**Recommended:** Spread over 4-6 sessions, practice with your own image data between sessions.

---

## ‚úÖ **Learning Objectives Checklist**

- [ ] Understand how convolution operations work mathematically
- [ ] Implement convolution from scratch using NumPy
- [ ] Build CNNs in PyTorch and TensorFlow/Keras
- [ ] Explain the architecture evolution (LeNet ‚Üí AlexNet ‚Üí VGG ‚Üí Inception ‚Üí ResNet)
- [ ] Apply transfer learning to new domains (ImageNet ‚Üí wafer maps)
- [ ] Optimize CNNs for production (quantization, pruning, ONNX)
- [ ] Deploy CNN models with <100ms inference time
- [ ] Visualize what CNNs learn using Grad-CAM

---

**Let's dive into the world of Convolutional Neural Networks!** üöÄ

## üßÆ Part 1: Convolution Mathematics & Implementation

### **What is Convolution?**

**Intuition:** Slide a small filter (kernel) over an image, computing dot products at each position.

**Mathematical definition:**

For 2D convolution (image processing):

$$
(I * K)(x, y) = \sum_{i=-\infty}^{\infty} \sum_{j=-\infty}^{\infty} I(x-i, y-j) \cdot K(i, j)
$$

Where:
- $I$ = input image (e.g., 28√ó28 pixels)
- $K$ = convolution kernel/filter (e.g., 3√ó3)
- $*$ = convolution operation
- $(x, y)$ = output position

**Discrete form (finite image):**

$$
(I * K)(x, y) = \sum_{i=0}^{k_h-1} \sum_{j=0}^{k_w-1} I(x+i, y+j) \cdot K(i, j)
$$

Where $k_h, k_w$ are kernel height and width.

---

### **Example: 3√ó3 Edge Detection Filter**

**Input image $I$ (5√ó5):**
```
[[10, 10, 10,  0,  0],
 [10, 10, 10,  0,  0],
 [10, 10, 10,  0,  0],
 [10, 10, 10,  0,  0],
 [10, 10, 10,  0,  0]]
```

**Sobel filter $K$ (3√ó3) - vertical edge detector:**
```
[[-1,  0,  1],
 [-2,  0,  2],
 [-1,  0,  1]]
```

**Convolution at position (1,1):**
```
Result = 10√ó(-1) + 10√ó0 + 10√ó1
       + 10√ó(-2) + 10√ó0 + 10√ó2
       + 10√ó(-1) + 10√ó0 + 10√ó1
       = -10 + 0 + 10 - 20 + 0 + 20 - 10 + 0 + 10
       = 0
```

**At position (1,2) - on the edge:**
```
Result = 10√ó(-1) + 10√ó0 + 0√ó1
       + 10√ó(-2) + 10√ó0 + 0√ó2
       + 10√ó(-1) + 10√ó0 + 0√ó1
       = -10 + 0 + 0 - 20 + 0 + 0 - 10 + 0 + 0
       = -40  (strong vertical edge detected!)
```

**Output:** High values where vertical edges exist.

---

### **Convolution Properties**

#### **1. Padding**

**Problem:** Convolution shrinks output size.
- Input: 5√ó5, Filter: 3√ó3 ‚Üí Output: 3√ó3 (lost 2 pixels on each side)

**Solution:** Add zeros around input (padding).

**Types:**
- **Valid (no padding):** Output size = $(n - k + 1) \times (n - k + 1)$
- **Same (zero padding):** Output size = $n \times n$ (same as input)
- **Full padding:** Pad with $k-1$ zeros on each side

**Formula:** 
$$
\text{Output size} = \frac{n + 2p - k}{s} + 1
$$
Where:
- $n$ = input size
- $p$ = padding
- $k$ = kernel size
- $s$ = stride

---

#### **2. Stride**

**Definition:** Step size when sliding filter.

- **Stride = 1:** Slide 1 pixel at a time (default)
- **Stride = 2:** Slide 2 pixels at a time (reduce output size by 2√ó)

**Effect:**
- Larger stride ‚Üí smaller output ‚Üí fewer parameters ‚Üí faster computation
- Trade-off: May lose spatial information

**Example:**
- Input: 28√ó28, Filter: 3√ó3, Stride: 1, Padding: 0 ‚Üí Output: 26√ó26
- Input: 28√ó28, Filter: 3√ó3, Stride: 2, Padding: 0 ‚Üí Output: 13√ó13

---

#### **3. Dilation (Atrous Convolution)**

**Definition:** Insert gaps between kernel elements.

**Purpose:** Increase receptive field without increasing parameters.

**Dilation rate = 2:**
```
Original 3√ó3 kernel:
[a, b, c]
[d, e, f]
[g, h, i]

Dilated 3√ó3 kernel (effective 5√ó5):
[a, 0, b, 0, c]
[0, 0, 0, 0, 0]
[d, 0, e, 0, f]
[0, 0, 0, 0, 0]
[g, 0, h, 0, i]
```

**Use case:** Semantic segmentation (capture multi-scale context).

---

#### **4. Receptive Field**

**Definition:** Region in input that influences one output pixel.

**Example:**
- 1 conv layer (3√ó3): Receptive field = 3√ó3
- 2 conv layers (3√ó3): Receptive field = 5√ó5
- 3 conv layers (3√ó3): Receptive field = 7√ó7

**Formula (for $L$ layers):**
$$
RF = 1 + \sum_{i=1}^{L} (k_i - 1) \cdot \prod_{j=1}^{i-1} s_j
$$

**Why it matters:** Deeper networks see larger context (better for complex patterns).

---

### **Pooling Operations**

**Purpose:** Downsample feature maps (reduce spatial dimensions, increase efficiency).

#### **Max Pooling**

Take maximum value in each region.

**Example (2√ó2 max pooling, stride=2):**
```
Input (4√ó4):
[[1, 3, 2, 4],
 [5, 6, 1, 2],
 [7, 2, 8, 1],
 [0, 9, 3, 5]]

Output (2√ó2):
[[6, 4],   # max(1,3,5,6)=6, max(2,4,1,2)=4
 [9, 8]]   # max(7,2,0,9)=9, max(8,1,3,5)=8
```

**Advantages:**
- ‚úÖ Translation invariance (small shifts don't change max)
- ‚úÖ Computational efficiency (reduce spatial size by 4√ó)
- ‚úÖ Prevents overfitting (lossy compression)

---

#### **Average Pooling**

Take average value in each region.

**Example (2√ó2 average pooling):**
```
Input (4√ó4):
[[1, 3, 2, 4],
 [5, 6, 1, 2],
 [7, 2, 8, 1],
 [0, 9, 3, 5]]

Output (2√ó2):
[[3.75, 2.25],  # avg(1,3,5,6)=3.75, avg(2,4,1,2)=2.25
 [4.5,  4.25]]  # avg(7,2,0,9)=4.5,  avg(8,1,3,5)=4.25
```

**Use case:** Final feature aggregation (e.g., Global Average Pooling before classification).

---

#### **Global Average Pooling (GAP)**

Average entire feature map into single value per channel.

**Example:**
- Input: 7√ó7√ó512 (feature map with 512 channels)
- Output: 1√ó1√ó512 (512 values)

**Advantages:**
- ‚úÖ Eliminates fully connected layers (fewer parameters)
- ‚úÖ Spatial invariance (works for any input size)
- ‚úÖ Regularization effect (prevents overfitting)

**Used in:** ResNet, Inception, MobileNet

---

### **From-Scratch Implementation (NumPy)**

Now let's implement 2D convolution to understand the mechanics:

```python
import numpy as np
import matplotlib.pyplot as plt

def convolve2d(image, kernel, padding=0, stride=1):
    """
    2D convolution implementation from scratch.
    
    Args:
        image: 2D array (H, W)
        kernel: 2D array (kH, kW)
        padding: int (zero padding on all sides)
        stride: int (step size)
    
    Returns:
        output: 2D array (convolved feature map)
    """
    # Get dimensions
    H, W = image.shape
    kH, kW = kernel.shape
    
    # Add padding
    if padding > 0:
        image = np.pad(image, padding, mode='constant', constant_values=0)
        H, W = image.shape
    
    # Calculate output dimensions
    out_H = (H - kH) // stride + 1
    out_W = (W - kW) // stride + 1
    
    # Initialize output
    output = np.zeros((out_H, out_W))
    
    # Perform convolution
    for i in range(0, out_H):
        for j in range(0, out_W):
            # Extract region
            y_start = i * stride
            x_start = j * stride
            region = image[y_start:y_start+kH, x_start:x_start+kW]
            
            # Element-wise multiply and sum
            output[i, j] = np.sum(region * kernel)
    
    return output


def max_pool2d(image, pool_size=2, stride=2):
    """
    2D max pooling from scratch.
    
    Args:
        image: 2D array (H, W)
        pool_size: int (pooling window size)
        stride: int (step size)
    
    Returns:
        output: 2D array (pooled feature map)
    """
    H, W = image.shape
    
    # Calculate output dimensions
    out_H = (H - pool_size) // stride + 1
    out_W = (W - pool_size) // stride + 1
    
    # Initialize output
    output = np.zeros((out_H, out_W))
    
    # Perform max pooling
    for i in range(out_H):
        for j in range(out_W):
            y_start = i * stride
            x_start = j * stride
            region = image[y_start:y_start+pool_size, x_start:x_start+pool_size]
            output[i, j] = np.max(region)
    
    return output


# Example: Edge detection
print("="*80)
print("CONVOLUTION FROM SCRATCH - EDGE DETECTION")
print("="*80)

# Create simple test image (5√ó5)
image = np.array([
    [10, 10, 10,  0,  0],
    [10, 10, 10,  0,  0],
    [10, 10, 10,  0,  0],
    [10, 10, 10,  0,  0],
    [10, 10, 10,  0,  0]
], dtype=np.float32)

# Sobel filter (vertical edge detector)
sobel_vertical = np.array([
    [-1, 0, 1],
    [-2, 0, 2],
    [-1, 0, 1]
], dtype=np.float32)

# Apply convolution
output = convolve2d(image, sobel_vertical, padding=0, stride=1)

print(f"Input image shape: {image.shape}")
print(f"Kernel shape: {sobel_vertical.shape}")
print(f"Output shape: {output.shape}")
print(f"\nOutput (edge detected):\n{output}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(image, cmap='gray')
axes[0].set_title('Input Image')
axes[0].axis('off')

axes[1].imshow(sobel_vertical, cmap='gray')
axes[1].set_title('Sobel Filter (Vertical)')
axes[1].axis('off')

axes[2].imshow(output, cmap='gray')
axes[2].set_title('Convolution Output (Edge)')
axes[2].axis('off')

plt.tight_layout()
plt.show()

# Example: Max pooling
print("\n" + "="*80)
print("MAX POOLING FROM SCRATCH")
print("="*80)

# Create test feature map
feature_map = np.array([
    [1, 3, 2, 4],
    [5, 6, 1, 2],
    [7, 2, 8, 1],
    [0, 9, 3, 5]
], dtype=np.float32)

# Apply max pooling
pooled = max_pool2d(feature_map, pool_size=2, stride=2)

print(f"Input feature map:\n{feature_map}")
print(f"\nMax pooled (2√ó2, stride=2):\n{pooled}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].imshow(feature_map, cmap='viridis')
axes[0].set_title('Feature Map (4√ó4)')
axes[0].axis('off')
for i in range(4):
    for j in range(4):
        axes[0].text(j, i, f'{feature_map[i,j]:.0f}', 
                     ha='center', va='center', color='white', fontsize=14)

axes[1].imshow(pooled, cmap='viridis')
axes[1].set_title('Max Pooled (2√ó2)')
axes[1].axis('off')
for i in range(2):
    for j in range(2):
        axes[1].text(j, i, f'{pooled[i,j]:.0f}', 
                     ha='center', va='center', color='white', fontsize=16)

plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("From-scratch implementation complete!")
print("Now we'll use PyTorch/Keras for efficient GPU-accelerated convolutions.")
print("="*80)
```

---

### **Key Insights from Implementation**

1. **Convolution = Pattern Matching**
   - Filter weights define what pattern to detect
   - High output value = pattern found at that location
   - Different filters detect different features (edges, corners, textures)

2. **Computational Cost**
   - For each output pixel: $k_h \times k_w$ multiplications
   - Total operations: $(n-k+1)^2 \times k^2$ for $n \times n$ input, $k \times k$ filter
   - Example: 28√ó28 input, 3√ó3 filter = 26√ó26√ó9 = 6,084 multiplications per filter

3. **Why Use Frameworks (PyTorch/Keras)?**
   - ‚úÖ GPU acceleration (100-1000√ó faster)
   - ‚úÖ Optimized implementations (cuDNN, MKL)
   - ‚úÖ Automatic gradient computation (backpropagation through convolutions)
   - ‚úÖ Batch processing (multiple images at once)

---

**Next:** We'll build CNNs in PyTorch and Keras with GPU acceleration! üöÄ

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
"""
Synthetic Wafer Map Generator for CNN Training
This module generates realistic semiconductor wafer map defect patterns for training
defect classification models. Each wafer is 300√ó300 pixels representing die pass/fail.
Defect Types (20 classes):
    0: None (normal, random failures 0.5-2%)
    1: Center (failures concentrated in wafer center)
    2: Edge (failures at wafer edge)
    3: Scratch (linear horizontal defect)
    4: Ring (circular ring pattern)
    5: Cluster (localized cluster of failures)
    ... (15 more classes)
Business value: $5M-$20M per incident through faster root cause analysis.
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, TensorDataset
import time
# Set random seeds
np.random.seed(42)
torch.manual_seed(42)
print("="*80)
print("WAFER MAP DEFECT PATTERN GENERATOR")
print("="*80)
#------------------------------------------------------------------------------
# 1. Wafer Map Generation Functions
#------------------------------------------------------------------------------
def generate_wafer_mask(size=300):
    """
    Generate circular wafer mask (wafer is circular, not square).
    
    Returns:
        mask: 2D array (size√ósize), 1=valid die, 0=outside wafer
    """
    center = size // 2
    y, x = np.ogrid[:size, :size]
    distance = np.sqrt((x - center)**2 + (y - center)**2)
    mask = (distance <= center * 0.95).astype(np.float32)  # 95% of radius
    return mask
def generate_normal_wafer(size=300, fail_rate=0.01):
    """Normal wafer: random failures (0.5-2%)"""
    wafer = np.random.rand(size, size) < fail_rate
    mask = generate_wafer_mask(size)
    return wafer.astype(np.float32) * mask
def generate_center_defect(size=300):
    """Center defect: high failure density in center"""
    center = size // 2
    y, x = np.ogrid[:size, :size]
    distance = np.sqrt((x - center)**2 + (y - center)**2)
    
    # Gaussian peak at center
    defect = np.exp(-(distance**2) / (2 * (size/8)**2))
    wafer = (np.random.rand(size, size) < defect * 0.8).astype(np.float32)
    
    mask = generate_wafer_mask(size)
    return wafer * mask


### üìù Function: generate_edge_defect

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
def generate_edge_defect(size=300):
    """Edge defect: high failure at wafer edge"""
    center = size // 2
    y, x = np.ogrid[:size, :size]
    distance = np.sqrt((x - center)**2 + (y - center)**2)
    
    # Ring at edge
    edge_distance = np.abs(distance - center * 0.85)
    defect = np.exp(-(edge_distance**2) / (2 * (size/20)**2))
    wafer = (np.random.rand(size, size) < defect * 0.7).astype(np.float32)
    
    mask = generate_wafer_mask(size)
    return wafer * mask
def generate_scratch_defect(size=300):
    """Scratch: linear defect (horizontal, vertical, or diagonal)"""
    wafer = np.zeros((size, size), dtype=np.float32)
    
    # Random orientation
    orientation = np.random.choice(['horizontal', 'vertical', 'diagonal'])
    position = np.random.randint(size//4, 3*size//4)
    width = np.random.randint(3, 8)
    
    if orientation == 'horizontal':
        wafer[position-width:position+width, :] = 1
    elif orientation == 'vertical':
        wafer[:, position-width:position+width] = 1
    else:  # diagonal
        for i in range(size):
            j = i + position - size//2
            if 0 <= j < size:
                wafer[max(0, i-width):min(size, i+width), 
                      max(0, j-width):min(size, j+width)] = 1
    
    mask = generate_wafer_mask(size)
    return wafer * mask
def generate_ring_defect(size=300):
    """Ring defect: circular ring of failures"""
    center = size // 2
    y, x = np.ogrid[:size, :size]
    distance = np.sqrt((x - center)**2 + (y - center)**2)
    
    # Ring at random radius
    ring_radius = np.random.uniform(0.4, 0.7) * center
    ring_width = size / 15
    ring_distance = np.abs(distance - ring_radius)
    defect = np.exp(-(ring_distance**2) / (2 * ring_width**2))
    
    wafer = (np.random.rand(size, size) < defect * 0.8).astype(np.float32)
    mask = generate_wafer_mask(size)
    return wafer * mask
def generate_cluster_defect(size=300):
    """Cluster: localized cluster of failures"""
    wafer = np.zeros((size, size), dtype=np.float32)
    
    # Random cluster location
    n_clusters = np.random.randint(1, 4)
    for _ in range(n_clusters):
        cx = np.random.randint(size//4, 3*size//4)
        cy = np.random.randint(size//4, 3*size//4)
        cluster_size = np.random.uniform(size/15, size/8)
        
        y, x = np.ogrid[:size, :size]
        distance = np.sqrt((x - cx)**2 + (y - cy)**2)
        cluster = np.exp(-(distance**2) / (2 * cluster_size**2))
        wafer += (np.random.rand(size, size) < cluster * 0.9).astype(np.float32)
    
    wafer = np.clip(wafer, 0, 1)
    mask = generate_wafer_mask(size)
    return wafer * mask
# Dictionary mapping class index to generation function
DEFECT_GENERATORS = {
    0: lambda size: generate_normal_wafer(size, fail_rate=0.01),
    1: generate_center_defect,
    2: generate_edge_defect,
    3: generate_scratch_defect,
    4: generate_ring_defect,
    5: generate_cluster_defect,
    # For simplicity, we'll use variations of above for other classes
    6: lambda size: generate_normal_wafer(size, fail_rate=0.02),
    7: lambda size: generate_edge_defect(size) + generate_center_defect(size) * 0.3,
    8: lambda size: generate_ring_defect(size),
    9: lambda size: generate_cluster_defect(size),
}
DEFECT_NAMES = {
    0: "Normal",
    1: "Center",
    2: "Edge",
    3: "Scratch",
    4: "Ring",
    5: "Cluster",
    6: "Normal-High",
    7: "Edge+Center",
    8: "Ring-2",
    9: "Multi-Cluster",
}
#------------------------------------------------------------------------------
# 2. Generate Dataset
#------------------------------------------------------------------------------


### üìù Function: generate_wafer_dataset

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
def generate_wafer_dataset(n_samples=1000, n_classes=10, size=300):
    """
    Generate synthetic wafer map dataset.
    
    Args:
        n_samples: int (total samples)
        n_classes: int (number of defect classes)
        size: int (wafer map size, e.g., 300√ó300)
    
    Returns:
        X: np.array (n_samples, size, size, 1)
        y: np.array (n_samples,)
    """
    samples_per_class = n_samples // n_classes
    X = []
    y = []
    
    for class_idx in range(n_classes):
        generator = DEFECT_GENERATORS.get(class_idx, DEFECT_GENERATORS[0])
        
        for _ in range(samples_per_class):
            wafer = generator(size)
            X.append(wafer)
            y.append(class_idx)
    
    X = np.array(X)
    y = np.array(y)
    
    # Add channel dimension (grayscale)
    X = X[:, :, :, np.newaxis]
    
    return X, y
print("\nGenerating wafer map dataset...")
print("  - 7,000 training samples")
print("  - 1,500 validation samples")
print("  - 1,500 test samples")
print("  - 10 defect classes")
print("  - 300√ó300 pixel wafer maps\n")
start_time = time.time()
# Generate dataset
X_all, y_all = generate_wafer_dataset(n_samples=10000, n_classes=10, size=128)  # Using 128 for faster training
# Split into train/val/test
X_train, X_temp, y_train, y_temp = train_test_split(
    X_all, y_all, test_size=0.3, random_state=42, stratify=y_all
)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)
print(f"Dataset generated in {time.time() - start_time:.2f} seconds")
print(f"\nDataset shapes:")
print(f"  Training:   {X_train.shape}, labels: {y_train.shape}")
print(f"  Validation: {X_val.shape}, labels: {y_val.shape}")
print(f"  Test:       {X_test.shape}, labels: {y_test.shape}")
# Visualize samples from each class
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.ravel()
for class_idx in range(10):
    idx = np.where(y_train == class_idx)[0][0]
    axes[class_idx].imshow(X_train[idx, :, :, 0], cmap='RdYlGn_r', vmin=0, vmax=1)
    axes[class_idx].set_title(f'Class {class_idx}: {DEFECT_NAMES[class_idx]}')
    axes[class_idx].axis('off')
plt.tight_layout()
plt.suptitle('Wafer Map Defect Patterns (Sample from Each Class)', y=1.02, fontsize=14)
plt.savefig('wafer_map_samples.png', dpi=150, bbox_inches='tight')
print("\nSaved: wafer_map_samples.png")
plt.show()
#------------------------------------------------------------------------------
# 3. Build CNN in PyTorch
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("BUILDING CNN IN PYTORCH")
print("="*80)
# Device setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nDevice: {device}")
# Convert to PyTorch tensors
X_train_tensor = torch.FloatTensor(X_train).permute(0, 3, 1, 2).to(device)  # (N, C, H, W)
y_train_tensor = torch.LongTensor(y_train).to(device)
X_val_tensor = torch.FloatTensor(X_val).permute(0, 3, 1, 2).to(device)
y_val_tensor = torch.LongTensor(y_val).to(device)
X_test_tensor = torch.FloatTensor(X_test).permute(0, 3, 1, 2).to(device)
y_test_tensor = torch.LongTensor(y_test).to(device)
print(f"Tensor shapes (PyTorch format):")
print(f"  X_train: {X_train_tensor.shape} (N, C, H, W)")
print(f"  y_train: {y_train_tensor.shape}")
# Create DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataset = TensorDataset(X_val_tensor, y_val_tensor)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


### üìù Class: WaferCNN

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
class WaferCNN(nn.Module):
    """
    Custom CNN for wafer map defect classification.
    
    Architecture:
        Input(128√ó128√ó1) ‚Üí Conv(32, 5√ó5) + ReLU + MaxPool(2√ó2)
                         ‚Üí Conv(64, 3√ó3) + ReLU + MaxPool(2√ó2)
                         ‚Üí Conv(128, 3√ó3) + ReLU + MaxPool(2√ó2)
                         ‚Üí Flatten
                         ‚Üí Dense(256) + Dropout(0.5)
                         ‚Üí Output(10)
    """
    
    def __init__(self, num_classes=10):
        super(WaferCNN, self).__init__()
        
        # Convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5, stride=1, padding=2)
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1)
        self.pool3 = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Calculate flattened size: 128√ó128 ‚Üí 64√ó64 ‚Üí 32√ó32 ‚Üí 16√ó16
        self.flatten_size = 128 * 16 * 16
        
        # Fully connected layers
        self.fc1 = nn.Linear(self.flatten_size, 256)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)
    
    def forward(self, x):
        # Conv block 1
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        
        # Conv block 2
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        
        # Conv block 3
        x = F.relu(self.conv3(x))
        x = self.pool3(x)
        
        # Flatten
        x = x.view(x.size(0), -1)
        
        # Fully connected
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        
        return x
# Create model
model = WaferCNN(num_classes=10).to(device)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nModel architecture:")
print(model)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)
print(f"\nLoss: CrossEntropyLoss")
print(f"Optimizer: Adam (lr=0.001)")
print(f"Scheduler: ReduceLROnPlateau")
#------------------------------------------------------------------------------
# 4. Training Loop
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("TRAINING CNN")
print("="*80)
num_epochs = 20
train_losses = []
val_losses = []
train_accs = []
val_accs = []
best_val_acc = 0.0
print(f"\nTraining for {num_epochs} epochs...")
start_time = time.time()
for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    
    for batch_X, batch_y in train_loader:
        # Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Metrics
        train_loss += loss.item() * batch_X.size(0)
        _, predicted = torch.max(outputs.data, 1)
        train_total += batch_y.size(0)
        train_correct += (predicted == batch_y).sum().item()
    
    train_loss /= len(train_loader.dataset)
    train_acc = train_correct / train_total
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    
    # Validation
    model.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for batch_X, batch_y in val_loader:
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            
            val_loss += loss.item() * batch_X.size(0)
            _, predicted = torch.max(outputs.data, 1)
            val_total += batch_y.size(0)
            val_correct += (predicted == batch_y).sum().item()
    
    val_loss /= len(val_loader.dataset)
    val_acc = val_correct / val_total
    val_losses.append(val_loss)
    val_accs.append(val_acc)
    
    # Learning rate scheduling
    scheduler.step(val_loss)
    
    # Print progress
    if (epoch + 1) % 5 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1:2d}/{num_epochs}] "
              f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")
    
    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        torch.save(model.state_dict(), 'best_wafer_cnn.pth')
training_time = time.time() - start_time
print(f"\nTraining completed in {training_time:.2f} seconds")
print(f"Best validation accuracy: {best_val_acc:.4f}")
# Load best model
model.load_state_dict(torch.load('best_wafer_cnn.pth'))
# Visualize training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(train_losses, label='Train Loss')
ax1.plot(val_losses, label='Val Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax2.plot(train_accs, label='Train Accuracy')
ax2.plot(val_accs, label='Val Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('cnn_training_curves.png', dpi=150, bbox_inches='tight')
print("Saved: cnn_training_curves.png")
plt.show()
print("\n" + "="*80)
print("CNN trained successfully!")
print("Next: Evaluate on test set and study classic architectures")
print("="*80)


## üèõÔ∏è Part 2: Classic CNN Architectures

### **Evolution of CNN Architectures (1998-2020)**

---

### **1. LeNet-5 (1998) - The Pioneer**

**Author:** Yann LeCun  
**Dataset:** MNIST (handwritten digits)  
**Accuracy:** ~99% on MNIST

**Architecture:**
```
Input(32√ó32√ó1) ‚Üí Conv(6, 5√ó5) + tanh + AvgPool(2√ó2)
               ‚Üí Conv(16, 5√ó5) + tanh + AvgPool(2√ó2)
               ‚Üí Flatten
               ‚Üí Dense(120) + tanh
               ‚Üí Dense(84) + tanh
               ‚Üí Output(10) + softmax
```

**Parameters:** ~60K

**Key innovations:**
- ‚úÖ Convolutional layers for local feature extraction
- ‚úÖ Pooling for translation invariance
- ‚úÖ End-to-end trainable (no hand-crafted features)

**Limitations:**
- ‚ùå Shallow (only 2 conv layers)
- ‚ùå tanh activation (vanishing gradients)
- ‚ùå Small dataset (60K MNIST samples)

---

### **2. AlexNet (2012) - The Revolution**

**Authors:** Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton  
**Dataset:** ImageNet (1.2M images, 1000 classes)  
**Accuracy:** 84.7% top-5 (vs 73.8% second place)

**Architecture:**
```
Input(224√ó224√ó3) ‚Üí Conv(96, 11√ó11, stride=4) + ReLU + MaxPool(3√ó3, stride=2)
                 ‚Üí Conv(256, 5√ó5) + ReLU + MaxPool(3√ó3, stride=2)
                 ‚Üí Conv(384, 3√ó3) + ReLU
                 ‚Üí Conv(384, 3√ó3) + ReLU
                 ‚Üí Conv(256, 3√ó3) + ReLU + MaxPool(3√ó3, stride=2)
                 ‚Üí Flatten
                 ‚Üí Dense(4096) + ReLU + Dropout(0.5)
                 ‚Üí Dense(4096) + ReLU + Dropout(0.5)
                 ‚Üí Output(1000) + softmax
```

**Parameters:** ~60M

**Key innovations:**
- ‚úÖ **ReLU activation:** Faster training (6√ó faster than tanh)
- ‚úÖ **Dropout:** Prevents overfitting (50% dropout in FC layers)
- ‚úÖ **GPU training:** Trained on 2 GTX 580 GPUs (split model across GPUs)
- ‚úÖ **Data augmentation:** Random crops, flips, color jittering
- ‚úÖ **Local Response Normalization (LRN):** Lateral inhibition

**Impact:** Sparked the deep learning revolution. CNNs became dominant in computer vision.

---

### **3. VGG (2014) - Simplicity Wins**

**Authors:** Karen Simonyan, Andrew Zisserman (Oxford)  
**Dataset:** ImageNet  
**Accuracy:** 92.7% top-5 (VGG-16)

**Architecture (VGG-16):**
```
Input(224√ó224√ó3)
‚Üí Conv(64, 3√ó3) + ReLU ‚Üí Conv(64, 3√ó3) + ReLU ‚Üí MaxPool(2√ó2)
‚Üí Conv(128, 3√ó3) + ReLU ‚Üí Conv(128, 3√ó3) + ReLU ‚Üí MaxPool(2√ó2)
‚Üí Conv(256, 3√ó3) + ReLU ‚Üí Conv(256, 3√ó3) + ReLU ‚Üí Conv(256, 3√ó3) + ReLU ‚Üí MaxPool(2√ó2)
‚Üí Conv(512, 3√ó3) + ReLU ‚Üí Conv(512, 3√ó3) + ReLU ‚Üí Conv(512, 3√ó3) + ReLU ‚Üí MaxPool(2√ó2)
‚Üí Conv(512, 3√ó3) + ReLU ‚Üí Conv(512, 3√ó3) + ReLU ‚Üí Conv(512, 3√ó3) + ReLU ‚Üí MaxPool(2√ó2)
‚Üí Flatten
‚Üí Dense(4096) + ReLU + Dropout(0.5)
‚Üí Dense(4096) + ReLU + Dropout(0.5)
‚Üí Output(1000) + softmax
```

**Parameters:** 138M (VGG-16), 144M (VGG-19)

**Key insights:**
- ‚úÖ **Uniform architecture:** Only 3√ó3 convolutions throughout (simple, easy to understand)
- ‚úÖ **Depth matters:** Stacking 3√ó3 filters = larger receptive field (two 3√ó3 = one 5√ó5, but fewer parameters)
- ‚úÖ **Transfer learning:** VGG features transfer well to other tasks

**Limitations:**
- ‚ùå **Too many parameters:** 138M parameters (slow training, large memory)
- ‚ùå **Computational cost:** ~16B FLOPs for single forward pass

---

### **4. Inception v1 / GoogLeNet (2014) - Multi-Scale Features**

**Authors:** Christian Szegedy et al. (Google)  
**Dataset:** ImageNet  
**Accuracy:** 93.3% top-5

**Inception Module (core building block):**
```
Input feature map
    ‚Üì
    ‚îú‚Üí Conv(1√ó1) ‚Üí ReLU ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
    ‚îú‚Üí Conv(1√ó1) ‚Üí ReLU ‚Üí Conv(3√ó3) ‚Üí ReLU ‚îÄ‚î§
    ‚îú‚Üí Conv(1√ó1) ‚Üí ReLU ‚Üí Conv(5√ó5) ‚Üí ReLU ‚îÄ‚î§
    ‚îî‚Üí MaxPool(3√ó3) ‚Üí Conv(1√ó1) ‚Üí ReLU ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                              ‚Üì
                                       Concatenate
                                              ‚Üì
                                        Output feature map
```

**Key innovations:**
- ‚úÖ **Multi-scale processing:** 1√ó1, 3√ó3, 5√ó5 convolutions in parallel (capture features at different scales)
- ‚úÖ **1√ó1 convolutions:** Dimensionality reduction (reduce parameters and computation)
  - Example: 256 channels ‚Üí 1√ó1 conv(64) ‚Üí 3√ó3 conv ‚Üí fewer parameters than direct 3√ó3 on 256 channels
- ‚úÖ **Global Average Pooling:** Replace FC layers (fewer parameters, reduce overfitting)
- ‚úÖ **Auxiliary classifiers:** Intermediate loss functions help gradient flow in deep networks

**Parameters:** Only 6M (23√ó fewer than VGG-16!)

**Why it matters:**
- Efficiency: Fewer parameters, faster inference
- Multi-scale: Better feature extraction (objects at different scales)
- Influenced later architectures (ResNeXt, EfficientNet)

---

### **5. ResNet (2015) - The Game Changer**

**Authors:** Kaiming He et al. (Microsoft Research)  
**Dataset:** ImageNet  
**Accuracy:** 96.4% top-5 (ResNet-152)

**Problem:** Deep networks are hard to train (vanishing gradients, degradation problem).

**Solution:** **Residual connections (skip connections)** - allow gradients to flow directly through network.

**Residual Block:**
```
Input x
    ‚Üì
    ‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê (identity shortcut)
    ‚Üì                             ‚Üì
Conv(3√ó3) ‚Üí ReLU ‚Üí Conv(3√ó3)     ‚Üì
    ‚Üì                             ‚Üì
    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Add ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                  ‚Üì
                ReLU
                  ‚Üì
               Output
```

**Mathematical formulation:**
$$
\text{Output} = F(x) + x
$$

Where $F(x)$ is the residual mapping (what the layers learn).

**Why this works:**
- If identity mapping is optimal, layers can learn $F(x) = 0$ (easy!)
- Gradient flows directly through skip connection: $\frac{\partial y}{\partial x} = \frac{\partial F}{\partial x} + 1$
- The "+1" ensures gradient always flows (solves vanishing gradient)

**Architecture (ResNet-50):**
```
Input(224√ó224√ó3)
‚Üí Conv(64, 7√ó7, stride=2) + BN + ReLU + MaxPool(3√ó3, stride=2)
‚Üí [Residual Block √ó 3]  (64 channels)
‚Üí [Residual Block √ó 4]  (128 channels)
‚Üí [Residual Block √ó 6]  (256 channels)
‚Üí [Residual Block √ó 3]  (512 channels)
‚Üí GlobalAvgPool ‚Üí Dense(1000) + softmax
```

**Variants:**
- ResNet-18, ResNet-34: Basic residual blocks (2 conv layers per block)
- ResNet-50, ResNet-101, ResNet-152: Bottleneck blocks (1√ó1 ‚Üí 3√ó3 ‚Üí 1√ó1, more efficient)

**Parameters:** 25M (ResNet-50), 60M (ResNet-152)

**Impact:**
- ‚úÖ **Train very deep networks:** 152 layers (vs 19 in VGG)
- ‚úÖ **Better accuracy:** 96.4% top-5 (superhuman on ImageNet)
- ‚úÖ **Transfer learning:** ResNet features work amazingly well on other tasks
- ‚úÖ **Foundation for modern architectures:** Most SOTA models use residual connections

---

### **Architecture Comparison Table**

| Architecture | Year | Layers | Parameters | ImageNet Top-5 | Key Innovation |
|--------------|------|--------|------------|----------------|----------------|
| **LeNet-5** | 1998 | 7 | 60K | N/A (MNIST) | First successful CNN |
| **AlexNet** | 2012 | 8 | 60M | 84.7% | ReLU, dropout, GPU training |
| **VGG-16** | 2014 | 16 | 138M | 92.7% | Uniform 3√ó3 filters |
| **Inception v1** | 2014 | 22 | 6M | 93.3% | Multi-scale features, 1√ó1 conv |
| **ResNet-50** | 2015 | 50 | 25M | 96.4% | Residual connections (skip) |
| **ResNet-152** | 2015 | 152 | 60M | 96.4% | Extremely deep networks |

---

### **Modern Architectures (2017-2020)**

#### **DenseNet (2017)** - Connect Everything
- Every layer connects to every other layer (dense connections)
- Feature reuse, fewer parameters, better gradient flow
- Parameters: 8M (DenseNet-121)

#### **MobileNet (2017)** - Mobile Efficiency
- Depthwise separable convolutions (reduce parameters by 8-9√ó)
- Designed for mobile devices (latency <100ms)
- Parameters: 4M

#### **EfficientNet (2019)** - Compound Scaling
- Neural Architecture Search (NAS) to find optimal architecture
- Compound scaling: balance depth, width, and resolution
- SOTA accuracy with fewer parameters
- EfficientNet-B7: 84.4% top-1 accuracy (vs 80.9% ResNet-152)

#### **Vision Transformer (2020)** - Attention Without Convolutions
- Replaces convolutions with self-attention (inspired by NLP transformers)
- Treats image as sequence of patches (16√ó16 pixels)
- SOTA on ImageNet: 88.5% top-1 (ViT-Huge)
- Requires massive datasets (100M+ images)

---

### **When to Use Which Architecture?**

| Use Case | Recommended Architecture | Reason |
|----------|-------------------------|---------|
| **Research/Experimentation** | ResNet-50, EfficientNet | Good balance accuracy/speed |
| **Production (accuracy critical)** | ResNet-152, EfficientNet-B7 | Best accuracy |
| **Production (speed critical)** | MobileNet, EfficientNet-B0 | Fast inference |
| **Transfer learning** | ResNet-50, VGG-16 | Well-studied, many pre-trained models |
| **Mobile/Edge devices** | MobileNet, EfficientNet-B0-B2 | Designed for low latency |
| **Semiconductor (wafer maps)** | ResNet-34/50, EfficientNet-B1 | Good for small datasets with transfer learning |
| **Medical imaging** | DenseNet, ResNet | Feature reuse, high accuracy |
| **Real-time video** | MobileNet, YOLO | Low latency (<50ms) |

---

### **Implementing Classic Architectures**

**PyTorch:** Available in `torchvision.models`
```python
import torchvision.models as models

# Pre-trained on ImageNet
resnet50 = models.resnet50(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
inception_v3 = models.inception_v3(pretrained=True)
efficientnet_b0 = models.efficientnet_b0(pretrained=True)

# Custom number of classes (transfer learning)
resnet50.fc = nn.Linear(resnet50.fc.in_features, 20)  # 20 defect classes
```

**TensorFlow/Keras:** Available in `tf.keras.applications`
```python
from tensorflow.keras.applications import ResNet50, VGG16, InceptionV3, EfficientNetB0

# Pre-trained on ImageNet
resnet50 = ResNet50(weights='imagenet', include_top=False)
vgg16 = VGG16(weights='imagenet', include_top=False)
efficientnet = EfficientNetB0(weights='imagenet', include_top=False)

# Add custom classifier
x = resnet50.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
output = Dense(20, activation='softmax')(x)  # 20 defect classes
model = Model(inputs=resnet50.input, outputs=output)
```

---

**Next:** Transfer learning - fine-tune ResNet-50 on wafer maps! üöÄ

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
"""
Transfer Learning: Fine-Tune ResNet-50 on Wafer Maps
Transfer learning leverages pre-trained models (ImageNet) for new tasks with limited data.
Benefits:
    - Require 10√ó less training data (1K vs 10K+ from scratch)
    - Converge faster (5 epochs vs 50+)
    - Better accuracy (98%+ vs 95% from scratch)
    - Lower computational cost
Approach:
    1. Load pre-trained ResNet-50 (trained on ImageNet 1.2M images)
    2. Replace final classification layer (1000 classes ‚Üí 20 defect classes)
    3. Freeze early layers (keep learned low-level features like edges)
    4. Fine-tune last few layers on wafer map data
"""
import torch
import torch.nn as nn
import torchvision.models as models
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
import time
print("="*80)
print("TRANSFER LEARNING: ResNet-50 on Wafer Maps")
print("="*80)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"\nDevice: {device}")
#------------------------------------------------------------------------------
# 1. Load Pre-Trained ResNet-50
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("1. LOADING PRE-TRAINED ResNet-50")
print("="*80)
# Load ResNet-50 pre-trained on ImageNet
resnet50 = models.resnet50(pretrained=True)
print(f"\nOriginal ResNet-50:")
print(f"  Input: 224√ó224√ó3 (RGB images)")
print(f"  Output: 1000 classes (ImageNet)")
print(f"  Parameters: {sum(p.numel() for p in resnet50.parameters()):,}")
# Modify for grayscale input (1 channel instead of 3)
# Option 1: Replicate grayscale to 3 channels (easier, we'll use this)
# Option 2: Modify first conv layer to accept 1 channel
# Modify final layer for 10 defect classes
num_features = resnet50.fc.in_features  # 2048 for ResNet-50
resnet50.fc = nn.Linear(num_features, 10)  # 10 defect classes
resnet50 = resnet50.to(device)
print(f"\nModified ResNet-50:")
print(f"  Input: 224√ó224√ó3 (will replicate grayscale)")
print(f"  Output: 10 classes (defect types)")
print(f"  New FC layer: {num_features} ‚Üí 10")
#------------------------------------------------------------------------------
# 2. Freeze Early Layers (Feature Extraction)
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("2. FREEZING EARLY LAYERS")
print("="*80)


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# Freeze all layers except final FC layer
for name, param in resnet50.named_parameters():
    if "fc" not in name:  # Freeze all except FC layer
        param.requires_grad = False
trainable_params = sum(p.numel() for p in resnet50.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in resnet50.parameters())
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params*100:.1f}%)")
print(f"Frozen parameters: {total_params - trainable_params:,} ({(1-trainable_params/total_params)*100:.1f}%)")
print("\nStrategy: Feature extraction")
print("  - Early layers: FROZEN (keep ImageNet features)")
print("  - Final FC layer: TRAINABLE (learn defect-specific features)")
#------------------------------------------------------------------------------
# 3. Prepare Data for ResNet (224√ó224√ó3)
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("3. DATA PREPROCESSING FOR ResNet")
print("="*80)
# Resize wafer maps from 128√ó128 to 224√ó224 (ResNet input size)
# Replicate grayscale to 3 channels (RGB)
import torch.nn.functional as F_resize
# Resize and replicate channels
X_train_resnet = F_resize.interpolate(X_train_tensor, size=(224, 224), mode='bilinear', align_corners=False)
X_train_resnet = X_train_resnet.repeat(1, 3, 1, 1)  # 1 channel ‚Üí 3 channels
X_val_resnet = F_resize.interpolate(X_val_tensor, size=(224, 224), mode='bilinear', align_corners=False)
X_val_resnet = X_val_resnet.repeat(1, 3, 1, 1)
X_test_resnet = F_resize.interpolate(X_test_tensor, size=(224, 224), mode='bilinear', align_corners=False)
X_test_resnet = X_test_resnet.repeat(1, 3, 1, 1)
print(f"Original: {X_train_tensor.shape} (N, C=1, H=128, W=128)")
print(f"ResNet input: {X_train_resnet.shape} (N, C=3, H=224, W=224)")
# Create DataLoaders
train_dataset_resnet = TensorDataset(X_train_resnet, y_train_tensor)
train_loader_resnet = DataLoader(train_dataset_resnet, batch_size=16, shuffle=True)
val_dataset_resnet = TensorDataset(X_val_resnet, y_val_tensor)
val_loader_resnet = DataLoader(val_dataset_resnet, batch_size=16, shuffle=False)
#------------------------------------------------------------------------------
# 4. Training (Fine-Tuning)
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("4. FINE-TUNING ResNet-50")
print("="*80)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, resnet50.parameters()), lr=0.001)
num_epochs = 10  # Fewer epochs needed with transfer learning
train_losses_tl = []
val_losses_tl = []
train_accs_tl = []
val_accs_tl = []
print(f"\nTraining for {num_epochs} epochs (transfer learning)...")
start_time = time.time()


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
for epoch in range(num_epochs):
    # Training
    resnet50.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    
    for batch_X, batch_y in train_loader_resnet:
        outputs = resnet50(batch_X)
        loss = criterion(outputs, batch_y)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        train_loss += loss.item() * batch_X.size(0)
        _, predicted = torch.max(outputs.data, 1)
        train_total += batch_y.size(0)
        train_correct += (predicted == batch_y).sum().item()
    
    train_loss /= len(train_loader_resnet.dataset)
    train_acc = train_correct / train_total
    train_losses_tl.append(train_loss)
    train_accs_tl.append(train_acc)
    
    # Validation
    resnet50.eval()
    val_loss = 0.0
    val_correct = 0
    val_total = 0
    
    with torch.no_grad():
        for batch_X, batch_y in val_loader_resnet:
            outputs = resnet50(batch_X)
            loss = criterion(outputs, batch_y)
            
            val_loss += loss.item() * batch_X.size(0)
            _, predicted = torch.max(outputs.data, 1)
            val_total += batch_y.size(0)
            val_correct += (predicted == batch_y).sum().item()
    
    val_loss /= len(val_loader_resnet.dataset)
    val_acc = val_correct / val_total
    val_losses_tl.append(val_loss)
    val_accs_tl.append(val_acc)
    
    if (epoch + 1) % 2 == 0 or epoch == 0:
        print(f"Epoch [{epoch+1:2d}/{num_epochs}] "
              f"Train Loss: {train_loss:.4f} | Train Acc: {train_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} | Val Acc: {val_acc:.4f}")
training_time_tl = time.time() - start_time
print(f"\nTransfer learning training completed in {training_time_tl:.2f} seconds")
print(f"Final validation accuracy: {val_accs_tl[-1]:.4f}")
#------------------------------------------------------------------------------
# 5. Comparison: Custom CNN vs Transfer Learning
#------------------------------------------------------------------------------
print("\n" + "="*80)
print("5. COMPARISON: Custom CNN vs Transfer Learning")
print("="*80)
# Evaluate both models on test set
# Custom CNN (from previous cell)
model.eval()
test_correct_custom = 0
test_total_custom = 0
with torch.no_grad():


### üìù Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
    for batch_X, batch_y in test_loader:
        outputs = model(batch_X)
        _, predicted = torch.max(outputs.data, 1)
        test_total_custom += batch_y.size(0)
        test_correct_custom += (predicted == batch_y).sum().item()
test_acc_custom = test_correct_custom / test_total_custom
# ResNet-50 transfer learning
test_dataset_resnet = TensorDataset(X_test_resnet, y_test_tensor)
test_loader_resnet = DataLoader(test_dataset_resnet, batch_size=16, shuffle=False)
resnet50.eval()
test_correct_tl = 0
test_total_tl = 0
with torch.no_grad():
    for batch_X, batch_y in test_loader_resnet:
        outputs = resnet50(batch_X)
        _, predicted = torch.max(outputs.data, 1)
        test_total_tl += batch_y.size(0)
        test_correct_tl += (predicted == batch_y).sum().item()
test_acc_tl = test_correct_tl / test_total_tl
print(f"\nTest Set Accuracy:")
print(f"  Custom CNN (from scratch):      {test_acc_custom:.4f}")
print(f"  ResNet-50 (transfer learning):  {test_acc_tl:.4f}")
print(f"  Improvement: {(test_acc_tl - test_acc_custom)*100:.2f}%")
print(f"\nTraining Time:")
print(f"  Custom CNN:   {training_time:.2f} seconds (20 epochs)")
print(f"  ResNet-50:    {training_time_tl:.2f} seconds (10 epochs)")
print(f"\nParameters:")
print(f"  Custom CNN:   {sum(p.numel() for p in model.parameters()):,}")
print(f"  ResNet-50:    {sum(p.numel() for p in resnet50.parameters()):,} (only {trainable_params:,} trained)")
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
axes[0].plot(range(1, len(train_accs)+1), train_accs, label='Custom CNN Train', linestyle='--')
axes[0].plot(range(1, len(val_accs)+1), val_accs, label='Custom CNN Val', linestyle='--')
axes[0].plot(range(1, len(train_accs_tl)+1), train_accs_tl, label='ResNet-50 Train', linewidth=2)
axes[0].plot(range(1, len(val_accs_tl)+1), val_accs_tl, label='ResNet-50 Val', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Training Comparison: Custom CNN vs Transfer Learning')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Bar chart: Test accuracy comparison
models_names = ['Custom CNN\n(from scratch)', 'ResNet-50\n(transfer learning)']
test_accs = [test_acc_custom, test_acc_tl]
colors = ['#3498db', '#2ecc71']
axes[1].bar(models_names, test_accs, color=colors, alpha=0.8)
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Test Set Performance')
axes[1].set_ylim([0.8, 1.0])
axes[1].grid(True, alpha=0.3, axis='y')
# Add accuracy values on bars
for i, (name, acc) in enumerate(zip(models_names, test_accs)):
    axes[1].text(i, acc + 0.01, f'{acc:.4f}', ha='center', fontsize=12, fontweight='bold')
plt.tight_layout()
plt.savefig('transfer_learning_comparison.png', dpi=150, bbox_inches='tight')
print("\nSaved: transfer_learning_comparison.png")


### üìù Implementation Part 5

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
plt.show()
print("\n" + "="*80)
print("Transfer Learning Complete!")
print("="*80)
print("\nKey Takeaways:")
print("  ‚úÖ Transfer learning achieves higher accuracy with less training")
print("  ‚úÖ Leverages ImageNet features (edges, textures) for wafer maps")
print("  ‚úÖ Requires only 10 epochs vs 20+ from scratch")
print("  ‚úÖ Production recommendation: Use transfer learning for semiconductor applications")
print("="*80)


## üéØ Real-World Projects & Key Takeaways

### **Semiconductor Post-Silicon Validation Projects**

---

#### **Project 1: Production Wafer Map Defect Classifier**

**Objective:** Deploy real-time defect pattern classification system for 300mm wafer fabs.

**Business Value:** $5M-$20M per incident through automated root cause analysis (reduce time from days to hours).

**Dataset:**
- 50K+ real wafer maps from production STDF files
- 300√ó300 pixel resolution (10K-50K die per wafer)
- 25 defect classes (ring, scratch, cluster, edge, radial, random, + subclasses)
- Highly imbalanced (80% normal, 20% defects, rare classes <1%)

**Architecture:**
```
EfficientNet-B3 (transfer learning from ImageNet)
  ‚Üí Replace final layer (1000 ‚Üí 25 classes)
  ‚Üí Fine-tune with class-weighted loss (handle imbalance)
  ‚Üí Data augmentation (rotation, flip, zoom - wafers can be oriented any way)
```

**Implementation Steps:**
1. **Data pipeline:** Extract wafer maps from STDF ‚Üí 300√ó300 images ‚Üí augmentation
2. **Transfer learning:** Start with EfficientNet-B3 (pre-trained ImageNet)
3. **Handle imbalance:** Focal loss + class weights + SMOTE oversampling
4. **Ensemble:** Train 5 models with different seeds, average predictions (boost accuracy 2-3%)
5. **Explainability:** Grad-CAM to visualize which regions triggered classification
6. **Deployment:** ONNX Runtime with GPU, <50ms inference per wafer

**Success Metrics:**
- Top-1 accuracy ‚â•97% (25-class classification)
- Per-class recall ‚â•90% (catch all defect types)
- Inference latency <50ms per wafer (real-time fab integration)
- Explainability: Grad-CAM highlights defect regions (engineer validation)

**Challenges & Solutions:**
- **Imbalance:** Rare defects <1% ‚Üí Focal loss (down-weight easy examples), SMOTE oversampling
- **Rotation invariance:** Wafers oriented randomly ‚Üí Data augmentation (rotate 0-360¬∞)
- **False positives:** Cost of false alarm high ‚Üí Threshold tuning (precision >95%)
- **Production drift:** Defect patterns change over time ‚Üí Monthly retraining pipeline

---

#### **Project 2: Die-Level SEM Image Defect Detection**

**Objective:** Classify microscopic defects in die-level SEM (Scanning Electron Microscope) images.

**Business Value:** $1M-$5M/year through automated defect review (reduce manual inspection time 80%).

**Dataset:**
- 100K+ SEM images (512√ó512 pixels, grayscale)
- 50+ defect types (opens, shorts, voids, scratches, particles, contamination)
- Multi-label (single image can have multiple defects)

**Architecture:**
```
ResNet-50 backbone (transfer learning)
  ‚Üí Multi-label classification head (sigmoid for each class)
  ‚Üí Binary cross-entropy loss (multi-label)
```

**Key Techniques:**
- **Multi-label classification:** Single image ‚Üí multiple defect labels (not mutually exclusive)
- **High-resolution inputs:** 512√ó512 images (preserve fine details)
- **Test-time augmentation:** Average predictions across rotations/flips (improve accuracy)

**Success Metrics:**
- Mean Average Precision (mAP) ‚â•0.90 (multi-label metric)
- Per-defect recall ‚â•85% (catch all defect types)
- Inference <100ms per image (batch processing 100+ images/second)

---

#### **Project 3: Wafer Yield Prediction with Spatial Features (CNN + MLP)**

**Objective:** Predict die-level yield combining parametric test data with spatial location.

**Business Value:** $50M-$200M/year scrap reduction (notebook 052 problem, but with CNN spatial features).

**Dataset:**
- 500K+ die samples
- 50 parametric features (Vdd, Idd, frequency, power, temperature)
- Spatial coordinates (wafer_id, die_x, die_y)
- Binary target (pass/fail)

**Architecture (Hybrid CNN + MLP):**
```
Spatial Path (CNN):
  wafer_map (300√ó300) ‚Üí CNN ‚Üí Spatial features (128-dim)

Parametric Path (MLP):
  parametric_tests (50-dim) ‚Üí MLP ‚Üí Test features (128-dim)

Fusion:
  Concat(spatial_features, test_features) ‚Üí Dense(256) ‚Üí Output(1, Sigmoid)
```

**Why hybrid:**
- CNN captures spatial correlation (neighboring die fail together)
- MLP captures parametric test patterns
- Combined: Better than either alone (+5% accuracy improvement)

**Success Metrics:**
- AUC-ROC ‚â•0.98 (binary classification)
- Precision ‚â•95% (minimize false positives - shipping bad dies)
- Recall ‚â•90% (minimize false negatives - scrapping good dies)

---

#### **Project 4: Adaptive Binning with Reinforcement Learning + CNN**

**Objective:** Dynamically optimize device binning (speed grades) using RL with CNN spatial features.

**Business Value:** $20M-$50M/year through optimized binning (maximize high-bin yield).

**Problem:** Devices tested at multiple corners (VDD/frequency combinations), assign to bins (e.g., 3.0GHz, 3.2GHz, 3.5GHz) to maximize revenue.

**Architecture (RL with CNN features):**
```
State: Current test results + wafer map CNN features
Action: Which corner to test next, when to stop testing, which bin to assign
Reward: Revenue from bin assignment - test cost
Policy Network: DQN with CNN + MLP encoder
```

**Key Techniques:**
- CNN extracts spatial patterns from wafer map (predict untested corners)
- RL learns optimal test sequence (minimize tests, maximize bin revenue)
- Offline RL training on historical data (safe, no fab impact)

**Success Metrics:**
- Revenue increase ‚â•3-5% (better binning accuracy)
- Test time reduction ‚â•20% (fewer corners tested)
- Bin accuracy ‚â•99% (devices meet spec)

---

### **General AI/ML Projects**

---

#### **Project 5: Medical Image Classification (Chest X-Ray Diagnosis)**

**Objective:** Classify chest X-rays into normal/pneumonia/COVID-19/tuberculosis.

**Dataset:** 100K+ chest X-rays (256√ó256 grayscale), 4 classes.

**Architecture:** DenseNet-121 (transfer learning from ImageNet), class-weighted loss.

**Business Value:** $1M-$10M/year through faster diagnosis, reduce radiologist workload 50%.

**Success Metrics:** AUC-ROC ‚â•0.95, sensitivity ‚â•95% (minimize false negatives - critical for patient safety).

---

#### **Project 6: Object Detection for Autonomous Vehicles**

**Objective:** Real-time detection of cars, pedestrians, cyclists, traffic signs.

**Dataset:** 500K+ images with bounding boxes, 20 classes.

**Architecture:** YOLO v8 or EfficientDet (balance accuracy and speed).

**Business Value:** Enable Level 4 autonomous driving (market value $trillions).

**Success Metrics:** mAP ‚â•0.60, inference <50ms (20 FPS), detect objects 50m+ away.

---

#### **Project 7: Facial Recognition Access Control System**

**Objective:** Secure building access using face recognition (1:N matching).

**Dataset:** 10K+ employees, 100+ photos per person, various lighting/angles.

**Architecture:** FaceNet (triplet loss) or ArcFace (angular margin loss), generate 128-dim embeddings.

**Business Value:** $500K-$2M/year through automated access control, security audit trails.

**Success Metrics:** True Accept Rate ‚â•99% @ False Accept Rate <0.01%.

---

#### **Project 8: Content Moderation (Inappropriate Image Detection)**

**Objective:** Flag inappropriate content (NSFW, violence, hate symbols) on social media.

**Dataset:** 1M+ images, multi-label (single image can have multiple violations).

**Architecture:** EfficientNet-B4 (multi-label classification), weighted loss.

**Business Value:** $10M-$50M/year through automated moderation (reduce manual review 70%).

**Success Metrics:** Precision ‚â•95% (minimize false positives), recall ‚â•85% (catch most violations).

---

## üîë Key Takeaways

### **CNN Architecture Design Principles**

1. **Start simple, go deep gradually**
   - Begin with 3-5 conv layers (like VGG)
   - Add residual connections if >10 layers (like ResNet)
   - Use batch normalization after every conv layer

2. **Receptive field matters**
   - Stack small filters (3√ó3) instead of large filters (7√ó7)
   - Two 3√ó3 conv = one 5√ó5 receptive field, but fewer parameters
   - Use stride=2 or pooling to downsample (increase receptive field faster)

3. **Parameter efficiency**
   - 1√ó1 convolutions reduce channels (dimensionality reduction)
   - Depthwise separable convolutions reduce parameters 8-9√ó (MobileNet)
   - Global Average Pooling replaces FC layers (fewer parameters, regularization)

4. **Modern best practices**
   - ReLU activation (or variants: LeakyReLU, ELU, Swish)
   - Batch normalization (stabilizes training, allows higher learning rates)
   - Dropout in fully connected layers (prevents overfitting)
   - Data augmentation (rotation, flip, crop, color jittering)
   - Skip connections for deep networks (ResNet-style)

---

### **Transfer Learning Strategy**

**When to use transfer learning:**
- ‚úÖ Limited data (<10K samples)
- ‚úÖ Similar domain (natural images, medical images, wafer maps all benefit from ImageNet features)
- ‚úÖ Time/compute constraints (10√ó faster training)

**Best practices:**
1. **Start with feature extraction:** Freeze all layers except final classifier
2. **Fine-tune gradually:** After feature extraction converges, unfreeze last few layers
3. **Lower learning rate for fine-tuning:** Use 10-100√ó smaller LR for pre-trained layers
4. **Data augmentation:** More aggressive augmentation for small datasets
5. **Choose appropriate backbone:** ResNet-50 (balanced), EfficientNet (efficient), VGG (simple)

**When NOT to use transfer learning:**
- ‚ùå Very different domain (satellite imagery, medical histology)
- ‚ùå Massive dataset (>1M samples, train from scratch can be better)
- ‚ùå Real-time constraints (pre-trained models often large, use MobileNet instead)

---

### **Production Deployment Checklist**

- ‚úÖ **Model optimization:** ONNX export, INT8 quantization (4√ó faster inference)
- ‚úÖ **Batch inference:** Process multiple images at once (10√ó throughput increase)
- ‚úÖ **GPU acceleration:** Use ONNX Runtime with GPU provider (<10ms inference)
- ‚úÖ **Monitoring:** Log predictions, latency, errors (detect model drift)
- ‚úÖ **Explainability:** Grad-CAM or LIME to visualize decisions (engineer trust)
- ‚úÖ **A/B testing:** Compare new model vs baseline before full rollout
- ‚úÖ **Fallback:** Simple rule-based system if model fails
- ‚úÖ **Retraining pipeline:** Automate monthly retraining as new data arrives

---

### **Semiconductor-Specific Best Practices**

1. **Data augmentation for wafer maps:**
   - Rotation (0-360¬∞, wafers can be oriented any way)
   - Flip (horizontal, vertical)
   - Zoom (simulate different die sizes)
   - **Don't use:** Color jittering (wafer maps are binary pass/fail)

2. **Handle class imbalance:**
   - Most wafers are normal (80%+), defects rare (<20%)
   - Use focal loss, class weights, or SMOTE oversampling
   - Prioritize recall for rare critical defects

3. **Spatial correlation matters:**
   - CNN captures neighboring die failures (cluster patterns)
   - Combine with parametric test data (hybrid CNN + MLP)
   - Use attention mechanisms to focus on defect regions

4. **Explainability is critical:**
   - Engineers need to understand why model classified wafer
   - Use Grad-CAM to highlight defect regions
   - Build trust through interpretability

---

### **What's Next?**

**After mastering CNNs:**
- üìò **Notebook 054:** Recurrent Neural Networks (RNNs, LSTMs) for time-series
- üìò **Notebook 055:** Transformers and Attention (modern SOTA architecture)
- üìò **Notebook 056:** Object Detection (YOLO, Faster R-CNN, RetinaNet)
- üìò **Notebook 057:** Semantic Segmentation (U-Net, DeepLab)
- üìò **Notebook 058:** Generative Models (GANs, VAEs, Diffusion)

**Advanced CNN topics (not covered):**
- Neural Architecture Search (NAS): Automate architecture design
- Pruning and knowledge distillation: Compress models 10√ó
- 3D CNNs: Process video or volumetric medical data
- Graph CNNs: Process non-Euclidean data (molecules, social networks)

---

## ‚úÖ Learning Objectives Review

By now, you should be able to:
- ‚úÖ Understand convolution operations mathematically and implement from scratch
- ‚úÖ Build CNNs in PyTorch and TensorFlow/Keras
- ‚úÖ Explain the architecture evolution (LeNet ‚Üí AlexNet ‚Üí VGG ‚Üí Inception ‚Üí ResNet)
- ‚úÖ Apply transfer learning to new domains (ImageNet ‚Üí wafer maps)
- ‚úÖ Optimize CNNs for production (quantization, ONNX, batching)
- ‚úÖ Deploy CNN models with <100ms inference time
- ‚úÖ Visualize what CNNs learn using Grad-CAM
- ‚úÖ Choose appropriate architecture based on constraints (accuracy, speed, memory)

---

## üéì Congratulations!

You've mastered Convolutional Neural Networks! You can now:
- Build custom CNNs from scratch for any image task
- Leverage transfer learning for faster development
- Deploy production-grade models for semiconductor testing
- Optimize models for real-time inference constraints

**Next:** Dive into Recurrent Neural Networks for time-series and sequential data! üöÄ

---

## üìö Additional Resources

**Papers (Must Read):**
- LeNet-5: LeCun et al. (1998) - "Gradient-Based Learning Applied to Document Recognition"
- AlexNet: Krizhevsky et al. (2012) - "ImageNet Classification with Deep CNNs"
- VGG: Simonyan & Zisserman (2014) - "Very Deep CNNs for Large-Scale Image Recognition"
- Inception: Szegedy et al. (2014) - "Going Deeper with Convolutions"
- ResNet: He et al. (2015) - "Deep Residual Learning for Image Recognition"

**Courses:**
- Stanford CS231n: Convolutional Neural Networks for Visual Recognition
- Fast.ai: Practical Deep Learning for Coders (transfer learning focus)

**Libraries:**
- `timm` (PyTorch Image Models): 700+ pre-trained models
- `tensorflow.keras.applications`: 30+ pre-trained models
- `segmentation_models.pytorch`: U-Net, FPN, DeepLab implementations

**Visualization:**
- Grad-CAM: Visualize what CNN learned
- TensorBoard: Monitor training in real-time
- Netron: Visualize model architecture

---

**Notebook Complete!** üéâ