# Survey Existing Research and Reproduce Available Solutions

## AR-Enhanced Warehouse Inventory Management System

Shiyu Xie  

---

## Table of Contents
1. [Research Survey Overview](#1-research-survey-overview)
2. [Paper 1: Precise Detection in Densely Packed Scenes (SKU-110K)](#2-paper-1-precise-detection-in-densely-packed-scenes)
3. [Paper 2: Unity Perception - Synthetic Data Generation](#3-paper-2-unity-perception)
4. [Paper 3: YOLO-based Retail Detection Solutions](#4-paper-3-yolo-based-retail-detection)
5. [Code Reproduction: YOLOv8 on SKU-110K](#5-code-reproduction)
6. [Baseline Performance Metrics](#6-baseline-performance-metrics)
7. [Analysis and Conclusions](#7-analysis-and-conclusions)
8. [How My Capstone Will Improve](#8-how-my-capstone-will-improve)

---

## 1. Research Survey Overview <a name="1-research-survey-overview"></a>

This notebook documents my research survey for the AR-Enhanced Warehouse Inventory Management System capstone project. The survey covers three main research areas:

| Research Area | Key Papers/Projects | Relevance to My Project |
|--------------|---------------------|------------------------|
| Dense Object Detection | SKU-110K (CVPR 2019) | Core detection algorithm for shelf products |
| Synthetic Data Generation | Unity Perception (2021) | Training data pipeline for warehouse scenes |
| Real-time Detection | YOLOv5/v8 Retail Solutions | Deployment architecture for AR integration |

### Research Resources Used
- **Papers with Code:** https://paperswithcode.com/
- **ArXiv:** https://arxiv.org/
- **GitHub:** https://github.com/

---

## 2. Paper 1: Precise Detection in Densely Packed Scenes <a name="2-paper-1-precise-detection-in-densely-packed-scenes"></a>

### Citation
```
@inproceedings{goldman2019dense,
  author = {Eran Goldman and Roei Herzig and Aviv Eisenschtat and Jacob Goldberger and Tal Hassner},
  title = {Precise Detection in Densely Packed Scenes},
  booktitle = {Proc. Conf. Comput. Vision Pattern Recognition (CVPR)},
  year = {2019}
}
```

### Paper Links
- **Paper:** https://arxiv.org/abs/1904.00853
- **GitHub:** https://github.com/eg4000/SKU110K_CVPR19
- **Dataset:** http://trax-geometry.s3.amazonaws.com/cvpr_challenge/SKU110K_fixed.tar.gz

### Summary
This paper addresses object detection in densely packed scenes (e.g., retail shelves) where:
- Objects are numerous and tightly arranged
- Objects often look similar or identical
- Traditional NMS (Non-Maximum Suppression) fails due to overlapping boxes

### Key Contributions
1. **SKU-110K Dataset:** 11,762 images with 1.7M+ bounding box annotations from supermarkets worldwide
2. **Soft-IoU Layer:** Learns to predict the quality of detection boxes (Jaccard index)
3. **EM-Merger:** Replaces traditional NMS with probabilistic clustering using Mixture of Gaussians

### Dataset Statistics
| Split | Images | Avg Objects/Image |
|-------|--------|-------------------|
| Train | 8,219 | ~147 |
| Val | 588 | ~147 |
| Test | 2,936 | ~147 |

### Reported Performance (RetinaNet + Soft-IoU + EM-Merger)
| Metric | Value |
|--------|-------|
| AP@0.5 | 49.2% |
| AP@0.75 | 12.5% |

### Challenges Identified
- High object density causes overlapping detections
- Similar appearance makes boundary distinction difficult
- Standard detectors (Faster R-CNN, YOLO, RetinaNet) struggle with dense scenes

### Relevance to My Project
- **Direct Application:** Warehouse shelves have similar density characteristics
- **Dataset Usage:** SKU-110K serves as primary training data for my detection model
- **Baseline:** Establishes performance benchmarks to improve upon

---

## 3. Paper 2: Unity Perception - Synthetic Data Generation <a name="3-paper-2-unity-perception"></a>

### Citation
```
@article{borkman2021unity,
  title={Unity Perception: Generate Synthetic Data for Computer Vision},
  author={Borkman, Steve and Crespi, Adam and Dhakad, Saurav and others},
  journal={arXiv preprint arXiv:2107.04259},
  year={2021}
}
```

### Paper Links
- **Paper:** https://arxiv.org/abs/2107.04259
- **GitHub (Perception):** https://github.com/Unity-Technologies/com.unity.perception
- **GitHub (SynthDet):** https://github.com/Unity-Technologies/SynthDet
- **Tutorial:** https://docs.unity3d.com/Packages/com.unity.perception@1.0/manual/Tutorial/TUTORIAL.html

### Summary
Unity Perception is an open-source package for generating synthetic datasets with perfect annotations for computer vision tasks.

### Key Features
1. **Automatic Annotation:** Generates perfect ground truth labels (bounding boxes, segmentation masks)
2. **Domain Randomization:** Randomizes lighting, textures, positions, backgrounds to improve model generalization
3. **Scalability:** Can generate millions of annotated images using Unity Simulation (cloud)
4. **Multiple Tasks:** Supports 2D/3D detection, semantic segmentation, instance segmentation, keypoints

### SynthDet Project Results
Training Faster R-CNN on 63 grocery objects:

| Training Data | mAP@0.5 | mAP@0.5:0.95 |
|--------------|---------|---------------|
| Real data only (380 images) | 38.4% | 18.2% |
| Synthetic only (400K images) | 48.7% | 24.1% |
| Synthetic + Real fine-tuning | **60.2%** | **30.3%** |

**Key Finding:** Model trained on synthetic data + fine-tuned with small real dataset outperforms real-data-only model by ~22% mAP!

### Relevance to My Project
- **Data Generation:** Using Unity Perception for synthetic warehouse scene generation
- **Cost Reduction:** Eliminates expensive manual annotation
- **Domain Adaptation:** Synthetic + real data combination improves real-world performance
- **Already Implemented:** My GitHub repo contains Unity generation pipeline

---

## 4. Paper 3: YOLO-based Retail Detection Solutions <a name="4-paper-3-yolo-based-retail-detection"></a>

### Related Works & Repositories

| Project | GitHub Link | Description |
|---------|-------------|-------------|
| YOLOv5 Retail Detection | https://github.com/shayanalibhatti/Retail-Store-Item-Detection-using-YOLOv5 | YOLOv5 trained on SKU-110K |
| YOLOv8 Retail Detection | https://github.com/vmc-7645/YOLOv8-retail | YOLOv8 implementation for retail |
| YOLOv8 SKU-110K | https://github.com/AneeqMalik/YOLOv8-SKU-110K | Complete training notebook |
| Shelf Product Identifier | https://github.com/albertferre/shelf-product-identifier | YOLOv8 + embeddings for SKU recognition |
| Ultralytics Official | https://docs.ultralytics.com/datasets/detect/sku-110k/ | Official YOLO documentation for SKU-110K |

### YOLO Evolution for Retail Detection

| Model | Year | Key Improvement | SKU-110K Performance |
|-------|------|-----------------|---------------------|
| YOLOv3 | 2018 | Darknet backbone | Baseline |
| YOLOv5 | 2020 | PyTorch, smaller models | ~45% mAP@0.5 |
| YOLOv7 | 2022 | E-ELAN architecture | ~48% mAP@0.5 |
| YOLOv8 | 2023 | Anchor-free, improved head | ~52% mAP@0.5 |
| YOLOv11 | 2024 | Latest improvements | ~55% mAP@0.5 |

### Why YOLO for My Project?
1. **Real-time Performance:** Required for AR overlay (30+ FPS)
2. **Well-documented:** Extensive tutorials and pre-trained weights
3. **Active Development:** Ultralytics continuously improves models
4. **SKU-110K Support:** Built-in dataset configuration in Ultralytics

---

## 5. Code Reproduction: YOLOv8 on SKU-110K <a name="5-code-reproduction"></a>

This section reproduces the YOLOv8 training on SKU-110K dataset following the Ultralytics documentation and community implementations.

### Environment Setup

In [None]:
# Install required packages
!pip install ultralytics --quiet
!pip install opencv-python pandas matplotlib seaborn --quiet

In [None]:
# Import libraries
import os
import torch
from ultralytics import YOLO
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Check GPU availability
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

### Dataset Configuration

The SKU-110K dataset configuration is defined in the Ultralytics YAML file:

In [None]:
# SKU-110K Dataset Configuration
sku110k_config = """
# SKU-110K Dataset Configuration
# https://github.com/eg4000/SKU110K_CVPR19

# Dataset paths
path: ../datasets/SKU-110K  # dataset root dir
train: train.txt            # 8,219 images
val: val.txt                # 588 images
test: test.txt              # 2,936 images

# Classes
names:
  0: object

# Dataset Statistics:
# - Total Images: 11,762
# - Total Annotations: 1,733,678
# - Average objects per image: ~147
# - Dataset size: ~13.6 GB
"""
print(sku110k_config)

### Model Training

Training YOLOv8 on SKU-110K dataset:

In [None]:
# Load pre-trained YOLOv8 model
model = YOLO('yolov8n.pt')  # nano model for faster training

# Training configuration
# Note: Full training requires GPU and takes several hours
# Below is the configuration used for reproduction

training_config = {
    'data': 'SKU-110K.yaml',
    'epochs': 100,
    'imgsz': 640,
    'batch': 16,
    'patience': 10,  # early stopping
    'device': 0 if torch.cuda.is_available() else 'cpu',
    'workers': 4,
    'optimizer': 'SGD',
    'lr0': 0.01,
    'project': 'sku110k_training',
    'name': 'yolov8n_sku110k'
}

print("Training Configuration:")
for key, value in training_config.items():
    print(f"  {key}: {value}")

In [None]:
# TRAINING CODE (Uncomment to run - requires GPU and dataset)
# 
# from ultralytics import YOLO
# 
# # Load model
# model = YOLO('yolov8n.pt')
# 
# # Train model
# results = model.train(
#     data='SKU-110K.yaml',
#     epochs=100,
#     imgsz=640,
#     batch=16,
#     patience=10
# )
# 
# # Validate model
# metrics = model.val()
# print(f"mAP@0.5: {metrics.box.map50}")
# print(f"mAP@0.5:0.95: {metrics.box.map}")

### Inference Example

Running inference on sample shelf images:

In [None]:
# Inference code example
inference_code = """
from ultralytics import YOLO
import cv2

# Load trained model
model = YOLO('best.pt')  # trained weights

# Run inference on shelf image
results = model.predict(
    source='shelf_image.jpg',
    conf=0.5,           # confidence threshold
    iou=0.45,           # NMS IoU threshold
    save=True,          # save annotated images
    save_crop=True      # save cropped detections
)

# Process results
for result in results:
    boxes = result.boxes  # bounding boxes
    print(f"Detected {len(boxes)} objects")
    
    # Get detection details
    for box in boxes:
        xyxy = box.xyxy[0]  # box coordinates
        conf = box.conf[0]  # confidence score
        print(f"Box: {xyxy.tolist()}, Confidence: {conf:.2f}")
"""
print(inference_code)

---

## 6. Baseline Performance Metrics <a name="6-baseline-performance-metrics"></a>

### Reproduced Results from Literature

Based on the research papers and GitHub implementations, here are the baseline performance metrics:

In [None]:
# Baseline Performance Comparison
import pandas as pd
import matplotlib.pyplot as plt

# Data from research papers and implementations
baseline_data = {
    'Model': ['RetinaNet (Original)', 'RetinaNet + Soft-IoU', 'Faster R-CNN', 
              'YOLOv5n', 'YOLOv5s', 'YOLOv8n', 'YOLOv8s', 'YOLOv8m'],
    'mAP@0.5': [44.7, 49.2, 42.3, 43.5, 46.2, 48.3, 51.2, 53.8],
    'mAP@0.75': [10.8, 12.5, 9.2, 11.2, 13.1, 14.5, 16.8, 18.2],
    'Inference (ms)': [45, 52, 85, 8, 12, 6, 10, 22],
    'Source': ['CVPR 2019', 'CVPR 2019', 'Baseline', 
               'GitHub', 'GitHub', 'Ultralytics', 'Ultralytics', 'Ultralytics']
}

df = pd.DataFrame(baseline_data)
print("Baseline Performance Metrics on SKU-110K Dataset")
print("="*70)
print(df.to_string(index=False))

In [None]:
# Visualization of baseline metrics
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# mAP comparison
ax1 = axes[0]
x = range(len(df['Model']))
width = 0.35
bars1 = ax1.bar([i - width/2 for i in x], df['mAP@0.5'], width, label='mAP@0.5', color='steelblue')
bars2 = ax1.bar([i + width/2 for i in x], df['mAP@0.75'], width, label='mAP@0.75', color='coral')
ax1.set_xlabel('Model')
ax1.set_ylabel('mAP (%)')
ax1.set_title('Detection Accuracy Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(df['Model'], rotation=45, ha='right')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Speed comparison
ax2 = axes[1]
bars3 = ax2.bar(df['Model'], df['Inference (ms)'], color='green', alpha=0.7)
ax2.set_xlabel('Model')
ax2.set_ylabel('Inference Time (ms)')
ax2.set_title('Inference Speed Comparison')
ax2.set_xticklabels(df['Model'], rotation=45, ha='right')
ax2.axhline(y=33, color='red', linestyle='--', label='30 FPS threshold')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('baseline_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nKey Observations:")
print("1. YOLOv8 models achieve best mAP while maintaining real-time speed")
print("2. Soft-IoU improves RetinaNet by ~4.5% mAP on dense scenes")
print("3. All YOLO variants achieve <33ms inference (30+ FPS for AR)")

---

## 7. Analysis and Conclusions <a name="7-analysis-and-conclusions"></a>

### Strengths and Weaknesses of Existing Solutions

| Solution | Strengths | Weaknesses |
|----------|-----------|------------|
| **SKU-110K + Soft-IoU** | Handles dense scenes well; Novel EM-Merger reduces false positives | Slower inference; Complex architecture |
| **Unity Synthetic Data** | Perfect annotations; Unlimited data generation; Domain randomization | Requires 3D assets; Sim-to-real gap |
| **YOLOv8** | Fast inference; Easy deployment; Active community | Lower mAP@0.75; Generic NMS struggles with dense scenes |
| **YOLOv5 Retail** | Well-documented; Proven on SKU-110K | Older architecture; Lower accuracy than v8 |

### Challenges Identified Across All Solutions

1. **Dense Object Handling:** Standard NMS fails when objects heavily overlap
2. **Sim-to-Real Gap:** Models trained on synthetic data need fine-tuning on real data
3. **AR Integration:** No existing solution provides end-to-end AR overlay
4. **3D Spatial Understanding:** Current solutions lack depth estimation for AR positioning


---

## 8. How My Capstone Will Improve <a name="8-how-my-capstone-will-improve"></a>

### Proposed Improvements Over Existing Solutions

#### 1. **Hybrid Data Pipeline**
- Combine Unity synthetic data with SKU-110K real data


#### 2. **Custom Dense NMS**
- Implement Soft-NMS or learned NMS for dense scenes
- Integrate with YOLOv8 for best of both worlds


#### 3. **AR Integration Layer**
- Novel contribution: No existing solution provides AR overlay
- Leverage CT reconstruction background for 3D spatial understanding
- Real-time inventory visualization on mobile devices

#### 4. **Domain-Specific Optimization**
- Fine-tune on warehouse-specific data (vs. supermarket focus)
- Optimize for pallet-level and shelf-level detection



---

## References

### Research Papers
1. Goldman, E., et al. (2019). "Precise Detection in Densely Packed Scenes." CVPR 2019.
2. Borkman, S., et al. (2021). "Unity Perception: Generate Synthetic Data for Computer Vision." arXiv:2107.04259.
3. Redmon, J., & Farhadi, A. (2018). "YOLOv3: An Incremental Improvement." arXiv:1804.02767.

### GitHub Repositories
- SKU-110K: https://github.com/eg4000/SKU110K_CVPR19
- Unity Perception: https://github.com/Unity-Technologies/com.unity.perception
- SynthDet: https://github.com/Unity-Technologies/SynthDet
- Ultralytics YOLOv8: https://github.com/ultralytics/ultralytics
- YOLOv5 Retail: https://github.com/shayanalibhatti/Retail-Store-Item-Detection-using-YOLOv5
- YOLOv8 SKU-110K: https://github.com/AneeqMalik/YOLOv8-SKU-110K

### Datasets
- SKU-110K: http://trax-geometry.s3.amazonaws.com/cvpr_challenge/SKU110K_fixed.tar.gz
- Roboflow Warehouse: https://universe.roboflow.com/search?q=warehouse

---

**Notebook Author:** Shiyu Xie  
**GitHub Repository:** https://github.com/ShiyuXie0116/AR-Warehouse-Vision