# Object Detection Architecture Modifications Report

**Student:** Darshan  
**Dataset:** Waymo COCO (10,000 images)  
**Base Model:** Faster R-CNN with ResNet152 Backbone  

---

## 1. Baseline Model

### Architecture Overview
- **Backbone:** ResNet152 (pretrained on ImageNet)
- **Neck:** Feature Pyramid Network (FPN)
- **Head:** Faster R-CNN with RPN
- **Classes:** 4 (Vehicle, Pedestrian, Cyclist, Sign)
- **Parameters:** 76M total, 17.8M trainable

### Training Configuration
```python
BASELINE_CONFIG = {
    'batch_size': 32,
    'epochs': 30,
    'lr': 0.01,
    'lr_schedule': 'MultiStepLR (drop at epoch 20)',
    'optimizer': 'SGD (momentum=0.9)',
    'train_images': 9000,
    'val_images': 1000
}
```

## 2. Proposed Architecture Modifications

### Modification 1: CBAM (Convolutional Block Attention Module)

**Motivation:**  
Standard CNNs treat all channels and spatial locations equally. CBAM adds attention mechanisms to focus on:
- **Channel Attention:** Which feature channels are most important?
- **Spatial Attention:** Which spatial locations contain objects?

**Implementation:**  
Insert CBAM blocks after ResNet bottleneck layers in layer3 and layer4.

```python
class CBAM(nn.Module):
    def __init__(self, channels, reduction=16):
        super().__init__()
        # Channel attention
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.max_pool = nn.AdaptiveMaxPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channels, channels // reduction),
            nn.ReLU(),
            nn.Linear(channels // reduction, channels)
        )
        # Spatial attention
        self.conv = nn.Conv2d(2, 1, kernel_size=7, padding=3)
        
    def forward(self, x):
        # Channel attention
        avg_out = self.fc(self.avg_pool(x).view(x.size(0), -1))
        max_out = self.fc(self.max_pool(x).view(x.size(0), -1))
        channel_att = torch.sigmoid(avg_out + max_out).unsqueeze(2).unsqueeze(3)
        x = x * channel_att
        
        # Spatial attention
        avg_out = torch.mean(x, dim=1, keepdim=True)
        max_out, _ = torch.max(x, dim=1, keepdim=True)
        spatial_att = torch.sigmoid(self.conv(torch.cat([avg_out, max_out], dim=1)))
        x = x * spatial_att
        
        return x
```

**Expected Improvement:** +2-4% mAP  
**Computational Overhead:** ~5% increase in training time

### Modification 2: Enhanced FPN with Bottom-Up Path Augmentation (PANet-style)

**Motivation:**  
Standard FPN has top-down pathway only. Low-level features (fine details) take a long path to reach high-level predictions.

**Implementation:**  
Add bottom-up path augmentation to shorten information path from low-level to high-level features.

```
Standard FPN:           Enhanced FPN:
C5 → P5                 C5 → P5 ←→ N5
 ↓    ↓                  ↓    ↓    ↑
C4 → P4                 C4 → P4 ←→ N4
 ↓    ↓                  ↓    ↓    ↑
C3 → P3                 C3 → P3 ←→ N3
 ↓    ↓                  ↓    ↓    ↑
C2 → P2                 C2 → P2 ←→ N2
```

**Expected Improvement:** +3-5% mAP (especially for small objects)  
**Computational Overhead:** ~10% increase

### Modification 3: Multi-Scale Training

**Motivation:**  
Objects appear at different scales in autonomous driving scenes. Training with multiple scales improves scale invariance.

**Implementation:**  
Randomly sample input resolution from [640, 720, 800, 960] during training.

**Expected Improvement:** +2-3% mAP  
**Computational Overhead:** Minimal (same as baseline)

## 3. Experimental Setup

### Experiments to Run

| Exp ID | Model | Description | Expected mAP@0.5 |
|--------|-------|-------------|------------------|
| **E0** | Baseline | ResNet152 + Faster R-CNN | 0.40-0.50 |
| **E1** | + CBAM | Add attention to backbone | 0.43-0.53 |
| **E2** | + Enhanced FPN | Add bottom-up augmentation | 0.44-0.55 |
| **E3** | + CBAM + Enhanced FPN | Combined modifications | 0.46-0.57 |
| **E4** | + All + Multi-scale | Final optimized model | 0.48-0.59 |

### Training Commands

```bash
# E0: Baseline (already training)
--expname 'baseline_resnet152' --model 'customrcnn_resnet152'

# E1: Baseline + CBAM
--expname 'resnet152_cbam' --model 'customrcnn_resnet152_cbam'

# E2: Baseline + Enhanced FPN
--expname 'resnet152_enhancedfpn' --model 'customrcnn_resnet152_enhancedfpn'

# E3: Combined
--expname 'resnet152_cbam_enhancedfpn' --model 'customrcnn_resnet152_cbam_enhancedfpn'
```

## 4. Evaluation Metrics

### Primary Metrics (COCO Format)
- **AP@[0.5:0.95]:** Average Precision at IoU thresholds 0.5 to 0.95
- **AP@0.5:** Average Precision at IoU=0.5 (primary comparison metric)
- **AP@0.75:** Average Precision at IoU=0.75 (strict localization)
- **AR@100:** Average Recall with 100 detections per image

### Per-Class Metrics
- Vehicle AP
- Pedestrian AP
- Cyclist AP
- Sign AP

### Scale-wise Metrics
- Small objects (area < 32²)
- Medium objects (32² < area < 96²)
- Large objects (area > 96²)

In [None]:
# Import libraries for result visualization
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 5. Results Collection

### Baseline Results (E0)

Training started: [DATE]  
Training completed: [DATE]  
Total training time: [TIME]

**Final Metrics:**
```
AP@[0.5:0.95] = [TO BE FILLED]
AP@0.5 = [TO BE FILLED]
AP@0.75 = [TO BE FILLED]
AR@100 = [TO BE FILLED]
```

**Per-Class AP@0.5:**
- Vehicle: [TO BE FILLED]
- Pedestrian: [TO BE FILLED]
- Cyclist: [TO BE FILLED]
- Sign: [TO BE FILLED]

In [None]:
# Placeholder for results DataFrame
results = {
    'Model': ['Baseline', 'CBAM', 'Enhanced FPN', 'CBAM+FPN', 'All+Multiscale'],
    'AP@0.5': [0.0, 0.0, 0.0, 0.0, 0.0],  # TO BE FILLED
    'AP@0.75': [0.0, 0.0, 0.0, 0.0, 0.0],
    'AP@[0.5:0.95]': [0.0, 0.0, 0.0, 0.0, 0.0],
    'AR@100': [0.0, 0.0, 0.0, 0.0, 0.0],
    'Training Time (hrs)': [3.0, 3.2, 3.3, 3.5, 3.6]
}

df_results = pd.DataFrame(results)
print("Results Summary:")
print(df_results)

## 6. Performance Comparison Visualizations

### 6.1 Overall mAP Comparison

In [None]:
# Bar plot: mAP comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# AP@0.5 comparison
df_results.plot(x='Model', y='AP@0.5', kind='bar', ax=ax1, color='steelblue', legend=False)
ax1.set_title('Average Precision @ IoU=0.5', fontsize=14, fontweight='bold')
ax1.set_ylabel('AP@0.5', fontsize=12)
ax1.set_xlabel('Model Architecture', fontsize=12)
ax1.set_ylim([0, 0.7])
ax1.grid(axis='y', alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# AP@[0.5:0.95] comparison
df_results.plot(x='Model', y='AP@[0.5:0.95]', kind='bar', ax=ax2, color='coral', legend=False)
ax2.set_title('Average Precision @ IoU=0.5:0.95', fontsize=14, fontweight='bold')
ax2.set_ylabel('AP@[0.5:0.95]', fontsize=12)
ax2.set_xlabel('Model Architecture', fontsize=12)
ax2.set_ylim([0, 0.5])
ax2.grid(axis='y', alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('mAP_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Saved: mAP_comparison.png")

### 6.2 Per-Class Performance Breakdown

In [None]:
# Per-class AP data (to be filled after experiments)
per_class_data = {
    'Model': ['Baseline']*4 + ['CBAM']*4 + ['Enhanced FPN']*4 + ['CBAM+FPN']*4,
    'Class': ['Vehicle', 'Pedestrian', 'Cyclist', 'Sign']*4,
    'AP@0.5': [0.0]*16  # TO BE FILLED
}

df_perclass = pd.DataFrame(per_class_data)

# Grouped bar chart
fig, ax = plt.subplots(figsize=(12, 6))
df_pivot = df_perclass.pivot(index='Class', columns='Model', values='AP@0.5')
df_pivot.plot(kind='bar', ax=ax, width=0.8)
ax.set_title('Per-Class AP@0.5 Comparison', fontsize=14, fontweight='bold')
ax.set_ylabel('AP@0.5', fontsize=12)
ax.set_xlabel('Object Class', fontsize=12)
ax.legend(title='Model', bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid(axis='y', alpha=0.3)
ax.tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.savefig('per_class_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Saved: per_class_comparison.png")

### 6.3 Training Efficiency Analysis

In [None]:
# Scatter plot: mAP vs Training Time
fig, ax = plt.subplots(figsize=(10, 6))

scatter = ax.scatter(df_results['Training Time (hrs)'], 
                     df_results['AP@0.5'], 
                     s=200, alpha=0.6, c=range(len(df_results)), 
                     cmap='viridis', edgecolors='black', linewidth=1.5)

# Add labels for each point
for idx, row in df_results.iterrows():
    ax.annotate(row['Model'], 
                (row['Training Time (hrs)'], row['AP@0.5']),
                xytext=(5, 5), textcoords='offset points',
                fontsize=10, fontweight='bold')

ax.set_xlabel('Training Time (hours)', fontsize=12)
ax.set_ylabel('AP@0.5', fontsize=12)
ax.set_title('Model Performance vs Training Time Trade-off', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('efficiency_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("✅ Saved: efficiency_comparison.png")

## 7. Analysis & Discussion

### Key Findings

1. **CBAM Attention Impact:**
   - [TO BE FILLED after experiments]
   - Most effective for: [class/scale]
   - Computational cost: acceptable/high

2. **Enhanced FPN Impact:**
   - [TO BE FILLED]
   - Small object detection: improved/similar
   - Best for: [scenario]

3. **Combined Modifications:**
   - [TO BE FILLED]
   - Synergistic effects: observed/not observed
   - Final recommendation: [model choice]

### Challenges & Solutions

- **Challenge 1:** [TO BE FILLED]
  - **Solution:** [TO BE FILLED]

- **Challenge 2:** [TO BE FILLED]
  - **Solution:** [TO BE FILLED]

### Future Improvements

1. Test with larger datasets (full Waymo/nuScenes)
2. Implement Cascade R-CNN for better localization
3. Try Deformable DETR for end-to-end training
4. Experiment with Vision Transformer backbones

## 8. Conclusion

[TO BE FILLED after completing all experiments]

Summary of improvements:
- Baseline mAP@0.5: [X.XX]
- Best modified model: [MODEL NAME]
- Best mAP@0.5: [X.XX]
- Improvement: +[X.X]% relative gain

**Selected model for deployment:** [MODEL NAME]  
**Justification:** [REASONING]

---

## 9. References

1. Ren, S., et al. (2015). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." NeurIPS.
2. Lin, T. Y., et al. (2017). "Feature Pyramid Networks for Object Detection." CVPR.
3. Woo, S., et al. (2018). "CBAM: Convolutional Block Attention Module." ECCV.
4. Liu, S., et al. (2018). "Path Aggregation Network for Instance Segmentation." CVPR.
5. He, K., et al. (2016). "Deep Residual Learning for Image Recognition." CVPR.

---

## Source Code Repository

**GitHub:** https://github.com/Darshanhub/DeepDataMiningPersonal  
**Training Scripts:** `DeepDataMiningLearning/detection/mytrain.py`  
**Model Definitions:** `DeepDataMiningLearning/detection/models.py`  
**This Report:** `architecture_modifications.ipynb`