Instance segmentation of dishware in cluttered kitchen scenes using a fine-tuned Faster R-CNN with Feature Pyramid Network (FPN). Developed for APS360 (Applied Deep Learning) at the University of Toronto, Sep-Dec 2025.
Detecting and segmenting individual dishware items in cluttered kitchen environments is a challenging computer vision task with applications in robotics, smart kitchens, and assistive technology. This project fine-tunes a Faster R-CNN model with a ResNet-50 backbone to perform instance segmentation on plates, bowls, cups, and utensils. The model is benchmarked against a YOLOv8n baseline, achieving a significant improvement in detection accuracy.
The model builds on Faster R-CNN with the following components:
- Backbone: ResNet-50 with a Feature Pyramid Network (FPN) for multi-scale feature extraction.
- Region Proposal Network (RPN): Custom anchor sizes of 16, 32, 64, 128, 256, and 512 pixels to accommodate dishware ranging from small utensils to large plates.
- RoI Align: Precise region-of-interest alignment, avoiding the quantization artifacts of RoI Pooling.
- Dual Prediction Heads: One branch for bounding box regression and classification, and a second branch for pixel-level mask generation.
The baseline comparison model is YOLOv8n (nano), a lightweight single-stage detector.
- Source: 245 images curated from COCO, LVIS, and Open Images datasets.
- Split: 70/15/15 stratified train/validation/test split.
- Augmentation: Random horizontal flip, random crop, and brightness jitter applied during training.
| Model | AP50 | mAP (0.50:0.95) |
|---|---|---|
| Faster R-CNN (ours) | 0.270 | 0.134 |
| YOLOv8n (baseline) | 0.144 | 0.121 |
Key findings:
- The model achieves strong detection performance on plates, bowls, and cups.
- Small utensils remain challenging due to scale variation and occlusion.
- On unseen data, the model produces an average of 13.17 detections per image at a confidence threshold of 0.60, demonstrating reasonable generalization.
- Upload
dishware_segmentation.ipynbto Google Colab. - Select a GPU runtime (Runtime > Change runtime type > GPU).
- Run all cells sequentially.
# Clone the repository
git clone https://github.com/BidoCodeHub/dishware-instance-segmentation.git
cd dishware-instance-segmentation
# Create a virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install torch torchvision matplotlib pycocotoolsThen open the notebook with Jupyter:
jupyter notebook dishware_segmentation.ipynbRequirements:
- Python 3.8+
- PyTorch 1.12+
- torchvision 0.13+
- matplotlib
- pycocotools
The notebook is organized into the following sections:
- Environment Setup - Install and import required libraries.
- Dataset Preparation - Download, preprocess, and augment the dishware images.
- Model Definition - Configure Faster R-CNN with custom anchors and FPN.
- Training - Fine-tune the model on the training set with validation monitoring.
- Evaluation - Compute AP50 and mAP metrics on the test set.
- Inference and Visualization - Run predictions on new images and visualize segmentation masks.
- Baseline Comparison - Train and evaluate YOLOv8n under the same conditions.
This project was completed as part of APS360 (Applied Deep Learning) at the University of Toronto. We thank the course instructors and teaching assistants for their guidance.
The training data is sourced from the COCO, LVIS, and Open Images datasets.
This project is licensed under the MIT License. See LICENSE for details.