# CNN Cheat Sheet

This cheat sheet covers the fundamentals of Convolutional Neural Networks (CNNs), key architectures and their evolution, and advanced training techniques. Each section includes brief explanations of the specified topics for quick reference. Use it to grasp basics and spot areas for deeper study.

## Fundamentals of Convolutional Neural Networks

### Convolution Operation
- **Definition**: Core operation in CNNs where a filter (small matrix) slides over the input image, performing element-wise multiplication and summation to detect patterns like edges or textures.
- **How it works**: Input (e.g., image pixels) * filter values → output value at each position.
- **Purpose**: Extracts local features while reducing parameters compared to fully connected layers.

### Filters/Kernels
- **Definition**: Small, learnable matrices (e.g., 3x3) that scan the input to detect specific features (e.g., horizontal edges).
- **Key points**: Multiple filters per layer create multiple feature maps. Initialized randomly, updated during training via backpropagation.
- **Size impact**: Smaller filters capture fine details; larger ones see broader patterns.

### Feature Maps
- **Definition**: Output grids from applying filters to the input, representing detected features at different spatial locations.
- **How formed**: One feature map per filter. Stacked together for the next layer's input.
- **Role**: Build hierarchical representations (low-level edges → high-level objects).

### Padding (Valid and Same)
- **Definition**: Adds extra pixels (usually zeros) around input borders to control output size.
- **Valid Padding**: No padding; output shrinks (e.g., 5x5 input with 3x3 filter → 3x3 output).
- **Same Padding**: Adds padding to keep output size same as input (e.g., add 1 pixel border for 3x3 filter).
- **Why use**: Prevents information loss at edges; "Same" maintains spatial dimensions.

### Strides
- **Definition**: Step size by which the filter slides over the input (e.g., stride=1: every pixel; stride=2: every other).
- **Impact**: Larger strides reduce output size and computation (downsampling).
- **Formula**: Output size = [(Input size - Filter size + 2*Padding) / Stride] + 1.

### Pooling (Max/Average)
- **Definition**: Downsampling operation to reduce spatial dimensions and computation while retaining key info.
- **Max Pooling**: Takes maximum value in a window (e.g., 2x2) → highlights strongest features.
- **Average Pooling**: Takes average value in window → smooths features, less common for classification.
- **Common use**: After convolution; e.g., 2x2 pooling with stride=2 halves dimensions.

### Convolutions Over Volumes
- **Definition**: Applying convolution to multi-channel inputs (e.g., RGB images with 3 channels).
- **How it works**: Filters have same depth as input channels (e.g., 3D filter for RGB). Sum across channels for each position.
- **Output**: Feature map depth equals number of filters, not input channels.
- **Extension**: For videos or 3D data, uses 3D convolutions (height, width, time/depth).

## CNN Architectures and Evolution

### LeNet-5
- **Overview**: Early CNN (1998) for handwritten digit recognition (MNIST dataset).
- **Key features**: 7 layers (conv, pool, fully connected); used tanh activation.
- **Significance**: Pioneered CNNs for image tasks; simple and efficient for small images.

### AlexNet
- **Overview**: Breakthrough CNN (2012) that won ImageNet challenge; 8 layers deep.
- **Key features**: ReLU activation, dropout, data augmentation; used GPU training.
- **Impact on Computer Vision**: Revived interest in deep learning; showed CNNs scale to large datasets (1M+ images); inspired modern architectures.

### InceptionNet/GoogLeNet
- **Overview**: 2014 model with 22 layers; efficient via Inception modules.
- **Inception Modules**: Parallel convolutions (1x1, 3x3, 5x5) + pooling, concatenated for multi-scale feature extraction.
- **1x1 Convolutions**: Reduce channel depth (bottleneck) to cut computation without losing info.
- **Significance**: Won ImageNet with fewer parameters; focused on efficiency.

### ResNet
- **Overview**: Deep residual network (2015); up to 152 layers; won ImageNet.
- **Skip Connections**: Add input directly to output (shortcut) to ease training deep nets.
- **Degradation Problem**: Deeper plain nets perform worse due to vanishing gradients/optimization issues.
- **Residual Blocks**: Core unit: conv layers + skip connection (output = F(x) + x); enables very deep models.

### MobileNet
- **Overview**: Lightweight CNN (2017) for mobile/edge devices; focuses on speed.
- **Depth-Wise Separable Convolutions**: Splits standard conv into depth-wise (per channel) + point-wise (1x1 across channels); reduces params/computation by ~8-9x.
- **Computational Efficiency**: Fewer FLOPs; tunable via width/depth multipliers for trade-offs.
- **Use cases**: Real-time apps like object detection on phones.

### EfficientNet
- **Overview**: 2019 family of models; scales depth, width, resolution optimally.
- **Key features**: Compound scaling (balances all dimensions); based on MobileNetV2 + NAS (neural architecture search).
- **Advantages**: State-of-the-art accuracy with fewer params/FLOPs; e.g., EfficientNet-B7 beats larger models.
- **Evolution**: Builds on prior nets for better efficiency in resource-constrained settings.

## Advanced Training Techniques for CNNs

### Data Augmentation
- **Definition**: Artificially expands dataset by transforming images to improve generalization.
- **Techniques**:
  - **Rotation**: Rotate images by angles (e.g., 0-360°) to handle orientations.
  - **Flipping**: Horizontal/vertical flips for symmetry invariance.
  - **Cropping**: Random crops/zooms to focus on parts and simulate variations.
  - **Color Jittering**: Adjust brightness, contrast, saturation, hue for lighting robustness.
- **Benefits**: Reduces overfitting; especially useful for small datasets.

### Transfer Learning
- **Definition**: Use pre-trained model (e.g., on ImageNet) as starting point for new task.
- **Why**: Leverages learned features (low-level: edges; high-level: objects); faster training, better performance with less data.
- **Steps**: Load pre-trained weights; replace final layers for new classes.

### Fine-Tuning Strategies
- **Definition**: Adjust pre-trained model for specific task.
- **Strategies**:
  - Freeze early layers (feature extractors), train later ones.
  - Gradually unfreeze and train with lower learning rates.
  - Use schedulers (e.g., reduce LR on plateau) to avoid destroying pre-trained weights.
- **When to use**: Small datasets → feature extraction; large datasets → full fine-tuning.

### Batch Normalization in CNNs
- **Definition**: Normalizes activations per mini-batch (mean=0, variance=1) + learnable scale/shift.
- **Placement**: After conv/FC layers, before activation.
- **Benefits**: Stabilizes training; allows higher LR; reduces internal covariate shift; slight regularization.
- **In CNNs**: Speeds up convergence; essential for deep nets like ResNet.

### Dropout in CNNs
- **Definition**: Randomly drops (sets to zero) neurons during training (e.g., p=0.5 probability).
- **Placement**: Typically after fully connected layers; sometimes after conv in CNNs.
- **Benefits**: Prevents overfitting by forcing ensemble-like behavior; no dropout at inference.
- **In CNNs**: Less common in conv layers due to spatial sharing, but useful in dense parts.

### Degradation Problem
- **Definition**: In deep plain networks, adding layers worsens performance (training/test error increases).
- **Causes**: Vanishing/exploding gradients; optimization difficulties.
- **Solutions**: ResNet's residual blocks (skip connections) mitigate by enabling identity mapping.
- **Context**: Highlighted in ResNet paper; also addressed by batch norm, better init (e.g., He init).