AI-Driven 2D Floorplan to BIM-Grade 3D Reconstruction
This project implements a research-oriented Scan-to-BIM pipeline that converts a 2D architectural floorplan image into a metric-scaled, editable 3D building model (GLB) using deep learning, computational geometry, and raster-to-vector reconstruction.
Unlike mesh-based visualizers, the system reconstructs architectural structure—walls, doors, topology, and scale—suitable for BIM-style downstream use.
A raster floorplan image (PNG / JPG)
A metric-scale 3D building model containing:
- Walls represented as solids
- Door openings with headers (lintels)
- Floor slab
- Consistent pixel-to-meter scaling
- Exportable
.glbmodel - Interactive browser-based visualization
The system focuses on geometric reconstruction, not surface visualization.
Our state-of-the-art methodology leverages lightweight prediction coupled with deep mathematical reconstruction.
- U-Net (ResNet-34 backbone): Customized Version 3. Trained for high-fidelity pixel-wise semantic segmentation of walls.
- YOLOv8 Medium: Trained to detect topological features (doors, sliding doors, windows) and architectural furniture (beds, toilets, sinks, cabinets).
- Custom Synthesized Raster Set: Employed dynamic OpenCV morphological augmentation (dilations, erosions) to train the U-Net for bridging semantic gaps.
- FloorPlanCAD (Adapted): Subset of high-quality architectural CAD vectors exported to raster specifically for training the YOLO object detector.
- Raster-to-Vector topology: Generates 1D spatial skeletons from 2D pixel masses.
- RDP (Ramer-Douglas-Peucker): Curve simplification algorithm condensing pixel clouds into distinct vector segments.
- PCA (Principal Component Analysis): Utilized for least-squares regression line-fitting on segmented pixel clouds.
-
Hann-window blending:
$H(n) = 0.5 \left(1 - \cos\left(\frac{2\pi n}{N-1}\right)\right)$ . Utilized to merge segmented$512 \times 512$ tile inferences smoothly back into high-resolution maps. - Bounding Centroids: Spatial formulas guaranteeing the Babylon.js FPS camera spawns strictly inside the interior topological boundary.
- AI/ML:
torch(PyTorch inference & training),ultralytics(YOLO engine). - Computer Vision:
cv2(OpenCV morphologies),skimage.morphology(skeletonization). - Geometry:
networkx(graph traversal),numpy(vector math),trimesh(3D boolean construction & GLB export). - Frontend:
streamlit(UI sandboxing),Babylon.js(WebGL interactive gamified navigation layer).
Throughout research and scale-up, several approaches were abandoned:
- Abandoned Models (Zero-Shot SAM): Initially attempted to use Meta's Segment Anything Model (SAM) alongside the Roboflow API. Zero-shot foundation models completely failed to comprehend arbitrary 2D architecture logic (e.g., standard CAD door swing arcs) and were inherently too slow for single-pass inference.
- Abandoned Datasets (Raw FloorPlanCAD): Training the U-Net on raw CAD vectors yielded brittle inferences. We realized we had to intentionally degrade our training data using noise and artifact injections to build a robust model capable of parsing real-world, messy raster images.
- Abandoned Algorithms (2D Topological Carves - Scenario B/C): We initially attempted to mathematically splice wall vectors purely in 2D space before 3D generation. This proved extremely brittle, causing cascading intersection failures on non-orthogonal walls. We pivoted to a 3D generative projection approach where the Python engine safely emits bounding box matrices (
door_metadata) and Babylon.js constructs the collision dynamics natively. - Abandoned Algorithms (Hough Line Transform): Standard computer vision heuristic for floorplan mapping. Resulted in highly fragmented, overlapping, and useless lines across thicker wall segments. Abandoned in favor of our hybrid Skeletonization + PCA strategy.
Floorplan Image
│
▼
[ Phase 1 ] CNN-Based Wall Probability Estimation
│
▼
[ Phase 4 ] High-Resolution Mask Generation
│
▼
[ Phase 5 ] Raster-to-Vector Geometry Extraction
│
▼
[ Phase 3 ] Procedural BIM-Style Construction
│
▼
GLB 3D Building Model
Each phase is modular and independently extensible.
Status: Implemented
- Model: UNet with ResNet-34 encoder
- Task: Pixel-wise wall probability estimation
- Output: Floating-point probability map
Key characteristics:
- ImageNet-normalized inference
- Aspect-ratio–preserving resizing
- GPU-accelerated PyTorch pipeline
Status: Implemented
Addresses the resolution mismatch between fixed-input CNNs and large floorplans.
- Input images are divided into overlapping 512×512 tiles
- Each tile is independently processed by the CNN
- Outputs are merged using Hann-window weighted blending
This preserves wall continuity and global geometry while maintaining high spatial resolution.
This approach is conceptually aligned with SAHI-style inference.
Status: Implemented
Post-processing of CNN probability maps to ensure structural integrity before vectorization.
Operations include:
- Hard thresholding
- Connected-component filtering for noise removal
- Morphological closing to bridge small gaps
- Morphological opening for edge smoothing
The output is a clean, contiguous wall mask suitable for geometric processing.
Status: Implemented
This phase performs the core Scan-to-BIM transformation by combining:
- Skeleton-based topology extraction
- Pixel-region–based geometric fitting
Pipeline steps:
- Skeletonization of the refined mask
- Conversion of the skeleton into a graph
- Tracing of wall paths between junctions
- Segmentation at geometric bends using RDP
- Extraction of wall pixel regions around each segment
- Least-squares (PCA) line fitting on pixel clouds
- Generation of CAD-grade wall axes
This produces straight, metric-accurate wall vectors while preserving connectivity.
Status: Implemented
Enhancements include:
- Vertex snapping and corner closure
- Manhattan-world (orthogonality) enforcement
- Noise-gap healing and wall truncation bridging
Objective: to produce a topologically consistent floorplan suitable for semantic reasoning.
Status: Implemented
Incorporates YOLOv8 object detection to identify architectural features (doors, sliding doors) and natively carve topological gaps into the vectorized walls based on advanced probabilistic geometric slicing heuristics.
Status: Implemented
Using Trimesh, vector geometry is converted into 3D architectural solids:
- Pixel-to-meter scaling
- Wall extrusion with thickness and height
- Door gap detection
- Header (lintel) generation
- Floor slab construction
- Watertight GLB export
Status: Implemented
Replaces the passive model viewer with a First-Person fully interactive spatial experience natively injected into Streamlit:
- Physics & Collision: Bounding Centroid calculations force the Universal Camera to spawn cleanly inside the geometry natively.
- Canvas Minimaps: Static HTML canvas rendering the 2D bounding walls with a live telemetry player tracker.
-
Raycast Interaction: Click-and-drag mouse look with a 2.5-meter Raycast establishing physical
$90^\circ$ dynamic door swings.
Status: Planned
- Editable wall thickness and door dimensions
- IFC / Revit-compatible export
- Stair detection and floor stacking
- Absolute scale calibration
This engine achieves state-of-the-art heuristic 3D reconstruction exclusively utilizing free-tier constrained resources (1x Kaggle P100/T4, 30 hrs/week).
Compared to contemporary "Floorplan-to-3D" repositories:
- FloorPlanTo3D-unityClient (Custom Mask R-CNN) utilizes heavy multi-day training loops on Resnet101 backbones to extract geometric walls. Instead, we shifted entirely to a fast morphology-cleaned U-Net topology generator.
- 3DPlanNet & DeepFloorplan (Hybrid Approaches) utilize similar object detection logic, however typically render offline using heavy Desktop clients. We achieve real-time topological slicing purely by math projections mapped to YOLOv8 Medium, keeping total VRAM utilization dramatically low (~12GB max).
- Babylon.js & Streamlit vs. Unity/Unreal Engine: Unlike Unity-client implementations, this project runs the entire spatial simulation seamlessly via sandboxed HTML
iframesrequiring exactly 0 external dependencies besides standard pip modules.
By balancing Deep Learning (U-Net & YOLO) strictly for prediction, and Computational Geometry (Raster-to-Vector topology gaps) for reconstruction, this engine mathematically exceeds the hardware limits of local consumer graphics cards while maintaining high precision.
A Streamlit-based interface provides:
- Floorplan upload
- Inference mode selection (Fast / High-Fidelity)
- End-to-end Scan-to-BIM execution
- Interactive 3D visualization
- GLB export
The viewer is implemented using Google’s <model-viewer> and Base64-embedded GLB rendering.
app.py → Web interface
pipeline.py → Scan-to-BIM processing pipeline
model_links.txt
sample_io/
| Capability | Status |
|---|---|
| CNN-based wall detection | Implemented |
| High-resolution tiled inference | Implemented |
| Mask refinement | Implemented |
| Skeleton-based topology | Implemented |
| Pixel-cloud line fitting | Implemented |
| CAD-grade wall vectors | Implemented |
| Phase 5B (Geometric Cleanup) | Implemented |
| Supervised YOLOv8 Training | Implemented (mAP50=0.80) |
| Phase 2C (Wall-Gap Correction) | Implemented |
| Metric scaling | Implemented |
| BIM-style 3D model | Implemented |
| Web-based visualization | Implemented |
- Thin or ambiguous walls may be missed by the U-Net.
- Intra-class YOLOv8 confusion (single vs. double vs. sliding doors), mitigated by treating all door categories interchangeably for structural logic.
- Furniture Orientation: Furniture orientation relies on snap-alignment to the nearest wall, which could be inaccurate for center-room furniture.
- Single-floor support only.
Most systems generate surface meshes. This system reconstructs architectural geometry.
It explicitly models topology, geometry, and scale to generate a true BIM-oriented pipeline.
Version 2.0 — Structural Scan-to-BIM Engine
- Phase 1 (Geometry Execution): U-Net Inference, Trimesh routing, and Phase 5B Junction Optimization are fully implemented.
- Phase 2 (Architectural Detection): A custom YOLOv8 Object Detection model has been happily trained on Kaggle (mAP50=0.80) over 15k FloorPlanCAD samples.
- Phase 2C (Structural Correlation): Completely implemented! High-fidelity YOLO detections retroactively project probabilistically onto Phase 1 maps, natively slicing intersections for real physically extracted door openings and placing dynamically sized furniture.
The core Phase 1 U-Net perception model was evaluated on a held-out dataset of 50 samples:
| Model | IoU | Dice | Precision | Recall | F1 | Accuracy |
|---|---|---|---|---|---|---|
| v1 (10ep) | 0.8224 | 0.9005 | 0.8957 | 0.9080 | 0.9005 | 0.9817 |
| v2 (+30ep) | 0.9333 | 0.9653 | 0.9716 | 0.9593 | 0.9653 | 0.9936 |
| v3 (+30ep) | 0.9613 | 0.9801 | 0.9824 | 0.9781 | 0.9801 | 0.9964 |
| v4 (aug+val) | 0.9190 | 0.9576 | 0.9415 | 0.9745 | 0.9576 | 0.9921 |
The active repository currently uses the v3 weights.