Deep Learning Project
Semantic segmentation is a fundamental problem in computer vision where the objective is to assign a class label to every pixel in an image. Unlike image classification, segmentation requires precise localization and boundary understanding.
In this project, the task is to segment images into three semantic classes:
- Pet
- Boundary
- Background
This problem is challenging due to:
- Fine object boundaries
- Variations in pet shapes, colors, and poses
- Class imbalance between foreground and background pixels
The project uses the Oxford-IIIT Pets Dataset, which contains images of cats and dogs along with pixel-wise segmentation masks.
Mask Classes (after preprocessing):
0→ Pet1→ Boundary2→ Background
The dataset is loaded using tensorflow_datasets (tfds) and split into training and testing sets.
To solve the segmentation problem, a pure U-Net architecture is implemented from scratch, without using any pretrained encoders.
U-Net is particularly well-suited for semantic segmentation because:
- It captures context through downsampling (encoder)
- It preserves spatial detail using skip connections
- It performs well even on limited datasets
- Images are resized to 128 × 128
- Pixel values are normalized to the range [0, 1]
- Masks are converted to integer labels suitable for sparse loss functions
- Random horizontal flipping is applied during training
- Helps improve generalization and reduce overfitting
The encoder consists of repeated blocks of:
- Two 3×3 convolution layers
- Batch normalization
- ReLU activation
- 2×2 max pooling for downsampling
Feature depth increases progressively: 64 → 128 → 256 → 512
The bottleneck captures high-level semantic features using:
- Two convolution layers with 1024 filters
The decoder restores spatial resolution using:
- Transposed convolutions for upsampling
- Skip connections from corresponding encoder layers
- Double convolution blocks after concatenation
This structure allows precise localization of object boundaries.
A 1×1 convolution with softmax activation produces a probability map for each class:
Output Shape = (H, W, 3)
Sparse Categorical Cross-Entropy is used:
- Suitable for multi-class pixel-wise classification
- Efficient as masks are stored as integer labels
- Mean Intersection over Union (Mean IoU)
- Pixel Accuracy
Mean IoU is the primary metric for segmentation quality as it measures overlap between predicted and ground-truth regions.
- Optimizer: Adam
- Batch size: 64
- Epochs: 10
Dataset pipeline uses:
- Caching
- Shuffling
- Batching
- Prefetching for performance optimization
The project includes visualization utilities that display:
- Input image
- Ground truth segmentation mask
- Predicted segmentation mask
These are shown side-by-side, allowing direct visual comparison of model performance.
After training, the model is evaluated on the test set and reports:
- Final test loss
- Mean IoU score
- Pixel accuracy
These metrics provide an objective assessment of segmentation performance.
- Python
- TensorFlow / Keras
- TensorFlow Datasets (TFDS)
- NumPy
- Matplotlib
This project demonstrates the effectiveness of a scratch-built U-Net for multi-class semantic segmentation. By combining careful preprocessing, a well-structured encoder–decoder architecture, and appropriate evaluation metrics, the model achieves accurate segmentation of pets and their boundaries. The project provides a strong foundation for understanding pixel-level deep learning tasks.