This repository contains the coursework project for COMP64301: Cognitive Robotics and Computer Vision. The project aims to design, implement, and evaluate computer vision algorithms, providing a comprehensive comparison between modern Deep Learning methods and traditional Computer Vision techniques for image classification.
The goal of this project is to benchmark two fundamentally different paradigms in computer vision across two distinct levels of difficulty—coarse-grained and fine-grained image classification.
We systematically compare a Convolutional Neural Network (CNN) via transfer learning against a classic Bag of Visual Words (BoW) pipeline using SIFT features and SVM.
Two datasets were selected to evaluate the models on different visual tracking tasks:
- Caltech 101 Subset (Coarse-grained): 10 diverse classes (airplanes, motorbikes, faces_easy, watch, leopards, bonsai, car_side, ketch, chandelier, hawksbill).
- Oxford-IIIT Pet Subset (Fine-grained): 10 visually similar cat and dog breeds (Abyssinian, Bengal, Birman, Bombay, British Shorthair, Egyptian Mau, Maine Coon, Persian, Ragdoll, Russian Blue).
Data Split:
Images were strictly divided into Training (70%), Validation (15%), and Testing (15%) sets. A fixed random seed (42) was used to ensure reproducible splits and reliable baseline comparisons.
We utilized a ResNet18 architecture, chosen for its excellent balance between computational efficiency and accuracy.
- Transfer Learning: Initialized with ImageNet pre-trained weights.
- Feature Extractor: All convolutional backbone layers were frozen.
- Classifier: Only the final Fully Connected (FC) layer was re-trained to map outputs to our 10 target classes.
- Preprocessing & Augmentation: Images resized to 224×224, normalized using standard ImageNet parameters, and randomly flipped horizontally during training to prevent overfitting.
- Training Setup: Adam Optimizer, Cross-Entropy Loss. Hyperparameter search over Learning Rates (
0.001,0.0005), Batch Sizes (32,64), and Epochs (15,30). Trained on an NVIDIA RTX 4060 GPU.
We implemented a classic Bag of Visual Words (BoW) pipeline from scratch.
- Feature Extraction: Extracted local scale-invariant features using the SIFT algorithm.
-
Visual Vocabulary: Clustered training descriptors using K-Means to build a visual dictionary. Explored vocabulary sizes of
$K \in {50, 100, 200, 500}$ . - Image Encoding: Mapped descriptors to visual words to generate normalized histogram representations for each image.
-
Classification: Trained Support Vector Machines (SVM), comparing Linear and Radial Basis Function (RBF) kernels, with regularization parameter
$C \in {0.1, 1.0, 10.0}$ . Executed entirely on CPU.
An extensive evaluation was conducted on the unseen Test set.
| Metric / Dimension | Deep Learning (ResNet18) | Traditional CV (BoW + SIFT) |
|---|---|---|
| Coarse Classification (Caltech101) | ~100% (Perfect classification) | 85.3% (Best: K=500, RBF Kernel) |
| Fine Classification (Oxford Pets) | ~92.7% | 41.7% |
| Computational Resources | Heavy; requires GPU (~166s training) | Lightweight; CPU sufficient (~30s training) |
| Interpretability | Low (Black box representations) | High (Explainable feature histograms) |
- Task Complexity: Traditional BoW pipelines perform reasonably well on coarse-grained tasks where distinct shapes and edge features (SIFT) are sufficient. However, they struggle significantly with fine-grained tasks (like pet breeds) where nuanced textures and hierarchical patterns are required.
- Feature Hierarchy: CNNs inherently learn deep, hierarchical spatial feature representations, driving their massive performance lead (92.7% vs 41.7%) on complex datasets.
- Trade-offs: While CNNs dominate in accuracy, they are opaque ("black-box") and hardware-intensive. Traditional methods remain highly interpretable, deterministic, and computationally lightweight.
.
├── code/
│ ├── generate_splits.py # Data splitting logic (70/15/15)
│ ├── generate_splits_and_folders.py # Prepares directory structures for PyTorch
│ ├── train_cnn.py # Script for training ResNet18
│ ├── train_bow.py # Script for SIFT extraction + K-Means + SVM
│ ├── test.py # Evaluation on test sets
│ └── final_plots.py # Generates loss curves and confusion matrices
├── data/
│ ├── raw/ # Original Caltech101 and Oxford-IIIT datasets
│ └── processed/ # Train/Val/Test subsets safely segregated
├── docs/ # Project report & LaTeX source files
| └── COMP64301_Assignment_v3.pdf # Final report PDF
└── results/
├── cnn_models/ # Saved PyTorch checkpoint weights (.pth)
├── figures/ # Confusion matrices and plots
├── cnn_results.csv # CNN test metrics
└── bow_results.csv # BoW test metrics