# Phase 6: CNN Model — Findings Report

**Summary:** Four CNN versions were trained on 3x64x64 rasterized room images with tabular side-inputs, progressively improving from MAE 17.90 (v1) to 8.07 (v4). The key finding: each improvement came from strengthening the tabular branch — not from better image understanding. Adding apartment type as a feature (v4) gave the largest single improvement (11.23→8.07), and CNN v4 now slightly beats LightGBM with the same features (MAE 8.24). However, the difference is small (0.17 MAE) — LightGBM remains the production model for Grasshopper deployment due to simplicity (no PyTorch inference overhead).

## 1. Architecture Evolution

| Version | Architecture Changes | Test MAE | vs Baseline (14f) | W&B Run |
|---------|---------------------|----------|-------------|--------|
| **v1** | Raw concat (256+16+3=275), FC(275→128→1) | 17.90 | +6.88 | [`3wcevehy`](https://wandb.ai/infau/furnisher-surrogate/runs/3wcevehy) |
| **v2** | Image bottleneck 256→64, tabular FC 19→32, head 96→64→1 | 12.40 | +1.38 | [`qutd7leh`](https://wandb.ai/infau/furnisher-surrogate/runs/qutd7leh) |
| **v3** | +`n_vertices`, +`aspect_ratio`, tabular skip connection | 11.23 | +0.21 | [`ld6iz2h4`](https://wandb.ai/infau/furnisher-surrogate/runs/ld6iz2h4) |
| **v4** | +`apartment_type` embedding (4-dim for 7 types) | 8.07 | −2.95 | — |
| **Baseline (14f)** | LightGBM on 14 tabular features | 11.02 | — | [`3t4hiefb`](https://wandb.ai/infau/furnisher-surrogate/runs/3t4hiefb) |
| **Baseline (21f)** | LightGBM on 21 features (+apartment_type one-hot) | 8.24 | — | — |

Each version's improvement came from shifting weight toward the tabular branch, not from better image understanding:

- **v1→v2 (−5.50 MAE):** Bottleneck compressed image features from 256→64 dims, tabular FC added (19→32 dims). The model could no longer rely on the noisy image branch and was forced to extract more from tabular features.
- **v2→v3 (−1.17 MAE):** Adding `n_vertices` and `aspect_ratio` as tabular scalars (features the baseline already had) plus a skip connection letting tabular features bypass the image branch entirely.
- **v3→v4 (−3.16 MAE):** Adding `apartment_type` as a 4-dim learned embedding. This was the largest single-version improvement, confirming that apartment context carries substantial predictive signal — especially for Living room and Kitchen (see [apartment type EDA](03-03_apartment_type_eda.ipynb)).

## 2. Key Finding

The CNN only approached baseline performance when the image branch was *weakened* and the tabular branch was *strengthened*. The spatial image information (room shape and door position rendered as pixels) adds **negligible predictive value** beyond what tabular features already capture.

Adding `apartment_type` improved both models substantially (CNN v3→v4: −28%, Baseline 14f→21f: −25%). CNN v4 now slightly beats LightGBM 21f (8.07 vs 8.24), but the 0.17 MAE difference is small. The improvement came entirely from new tabular signal (apartment type embedding), not from spatial features.

This is a meaningful result: it tells us that for this furnisher algorithm, room score is driven almost entirely by a handful of scalar features (area, vertex count, room type, apartment type, aspect ratio). Fine-grained spatial layout contributes minimally.

## 3. Per-Room-Type Comparison (v3 vs Baseline 14f)

| Room Type | n (test) | Baseline MAE | CNN v3 MAE | Delta | Better? |
|-----------|----------|-------------|-----------|-------|--------|
| Bedroom | 744 | 10.29 | 9.63 | −0.66 | CNN |
| Living room | 750 | 18.84 | 18.65 | −0.19 | CNN |
| Bathroom | 833 | 3.14 | 3.81 | +0.67 | Baseline |
| WC | 429 | 10.87 | 12.84 | +1.97 | Baseline |
| Kitchen | 833 | 16.89 | 16.67 | −0.22 | CNN |
| Children 1 | 494 | 6.61 | 6.56 | −0.05 | CNN |
| Children 2 | 286 | 7.48 | 8.22 | +0.74 | Baseline |
| Children 3 | 162 | 9.48 | 9.96 | +0.48 | Baseline |
| Children 4 | 63 | 8.97 | 9.62 | +0.65 | Baseline |

CNN v3 beats the baseline on Bedroom, Living room, Kitchen, and Children 1 — these are the room types where spatial layout *might* add value. But the improvements are marginal (< 1 point), while the regressions on WC (+1.97) and Bathroom (+0.67) offset them.

**Overall v3**: MAE=11.23, RMSE=19.18, R²=0.80, Fail/Pass accuracy=0.87.

### Impact of apartment_type (v4 + Baseline 21f)

Adding apartment type dramatically improved the two hardest room types:

| Room Type | Baseline 14f MAE | Baseline 21f MAE | Change |
|-----------|------------------|------------------|--------|
| Kitchen | 16.89 | 11.14 | −34% |
| Living room | 18.84 | 8.39 | −55% |

This is consistent with the EDA finding that apartment type has a large effect on Living room (eta-sq=0.19) and medium effect on Kitchen (eta-sq=0.11). The furnisher assigns different furniture depending on apartment context, and this directly affects scores for these room types.

## 4. Why CNN Doesn't Beat Baseline

Several factors explain why image features fail to add value:

1. **Score is driven by tabular features.** Area (r=+0.37), vertex count, and room type together explain most of the variance. These are all scalar features that tree ensembles handle natively.

2. **Room shape beyond vertex count is weakly predictive.** Whether a room is L-shaped or rectangular matters far less than whether it has 4 or 8 vertices. The image encodes shape detail that the score function doesn't strongly depend on.

3. **3×64×64 resolution may lose critical detail.** Door position and wall proportions at 64px resolution may not provide enough spatial precision for furniture placement reasoning.

4. **Bimodal score distribution favors tree models.** With 28% zeros and 42% scores ≥90, the prediction problem has sharp decision boundaries that gradient-boosted trees handle better than CNNs with MSE loss.

5. **Per-room normalization discards absolute size.** The rasterization normalizes each room to fill the 64×64 canvas, so the image doesn't encode absolute room size — the strongest predictor.

## 5. Potential Improvements (Future Reference)

If spatial features are revisited in future work, these directions could help:

- **Higher resolution** (128×128 or 256×256): May preserve door-wall relationships better
- **Attention mechanisms**: Spatial attention on image features could focus on door/wall interfaces
- **Two-stage model**: Separate zero/nonzero classifier + regression on nonzero rooms
- **Multi-task learning**: Predict room type + score jointly to improve feature extraction
- **Ensemble**: LightGBM + CNN weighted average (per-room-type weights based on validation)
- **Different furnisher algorithms**: The current furnisher may be unusually tabular-friendly; other algorithms might benefit more from spatial features

## 6. Model Artifacts

| Artifact | Path | Contents |
|----------|------|----------|
| CNN v4 checkpoint | `models/cnn_v4.pt` | v3 architecture + apt_type embedding (4-dim), normalization stats, test MAE=8.07 |
| CNN v3 checkpoint | `models/cnn_v3.pt` | `model_state_dict`, `config`, normalization stats, `epoch`, `val_mae`, `test_mae` |
| CNN v1 checkpoint | `models/cnn_v1.pt` | v1 architecture weights (kept for comparison) |
| Baseline model (21f) | `models/baseline_lgbm.joblib` | LightGBM production model with apartment_type, test MAE=8.24 |

### W&B Runs

| Run | ID | Link |
|-----|----|------|
| CNN v1 | `3wcevehy` | [wandb.ai](https://wandb.ai/infau/furnisher-surrogate/runs/3wcevehy) |
| CNN v2 | `qutd7leh` | [wandb.ai](https://wandb.ai/infau/furnisher-surrogate/runs/qutd7leh) |
| CNN v3 | `ld6iz2h4` | [wandb.ai](https://wandb.ai/infau/furnisher-surrogate/runs/ld6iz2h4) |
| Baseline (14f) | `3t4hiefb` | [wandb.ai](https://wandb.ai/infau/furnisher-surrogate/runs/3t4hiefb) |

## 7. Conclusion

**CNN v4 slightly beats LightGBM 21f (8.07 vs 8.24), but the difference is small.** LightGBM remains the production model for Grasshopper deployment — simpler, faster, no PyTorch inference overhead.

The pattern across all four CNN versions is clear: every improvement came from strengthening the tabular branch (bottleneck, skip connections, geometry features, apartment type), never from extracting better spatial information from images. The furnisher scoring function is dominated by scalar room properties and apartment context, not fine-grained spatial layout.

### Per-room-type: CNN v4 vs Baseline 21f

| Room Type | n | Baseline 21f | CNN v4 | Delta |
|-----------|---|-------------|--------|-------|
| Bedroom | 744 | 10.29 | **9.03** | −1.26 |
| Living room | 750 | **8.39** | 8.08 | −0.31 |
| Bathroom | 833 | **3.40** | 3.49 | +0.09 |
| WC | 429 | **10.70** | 13.24 | +2.54 |
| Kitchen | 833 | 11.14 | **9.87** | −1.27 |
| Children 1 | 494 | 6.51 | **6.49** | −0.02 |
| Children 2 | 286 | 7.15 | **7.11** | −0.04 |
| Children 3 | 162 | **8.86** | 10.57 | +1.71 |
| Children 4 | 63 | **8.01** | 8.55 | +0.54 |
| **Overall** | **4594** | **8.24** | **8.07** | **−0.17** |

CNN v4 wins on 6/9 room types. The WC regression (+2.54) is consistent across all CNN versions — the CNN struggles with binary-score room types where tree-based splitting excels.