Preprint | Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
School of Computer Science and Engineering, Xi'an University of Technology
BIM-CLIP is a language-guided multimodal framework for Building Information Modeling (BIM) component classification. It integrates three geometric modalitiesβpoint clouds, meshes, and multi-view imagesβaligned to language-derived semantic embeddings via contrastive learning, with a language-guided Cross-Modal Attention (CMA) module for adaptive multimodal fusion.
Figure1: Overview of the BIM-CLIP framework.
- Language-guided semantic alignment: Uses
text-embedding-ada-002embeddings (1536-dim) as semantic anchors to bridge geometric representations and high-level semantics. - Cross-Modal Attention (CMA): Language embeddings act as semantic queries to guide inter-modal feature interaction, transforming fusion from feature-driven aggregation into semantic-conditioned selection.
- Alignment-Preserving Fine-Tuning: Only the CMA module is updated during fine-tuning, preserving pretrained alignment with just 10.62M trainable parameters.
- Strong generalization: Competitive zero-shot and cross-dataset transfer performance on IFCNet and ModelNet.
Figure2: BIM-CLIP workflow and downstream applications.
Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.
| Method | Setting | Acc (%) | Prec (%) | F1 (%) |
|---|---|---|---|---|
| PointCLIP V2 | Zero-shot | 3.83 | 0.06 | 0.76 |
| ULIP-2 | Zero-shot | 1.64 | 1.42 | 0.79 |
| ULIP-2 | Zero-shot* | 39.76 | 30.14 | 29.21 |
| ULIP-2 | Fine-tuned | 87.43 | 87.99 | 86.60 |
| BIM-CLIP (Ours) | Zero-shot | 35.48 | 42.99 | 29.70 |
| BIM-CLIP (Ours) | Fine-tuned | 91.00 | 91.90 | 90.39 |
Zero-shot*: fine-tuned on BIMCompNet-1000, transferred to IFCNet without retraining.
| Method | BIMCompNet-100 F1 | BIMCompNet-500 F1 | BIMCompNet-1000 F1 |
|---|---|---|---|
| ULIP-2 | 79.72 | 88.56 | 91.02 |
| BIM-CLIP (Ours) | 87.44 | 91.28 | 91.83 |
| Method | ModelNet-10 mAP/Acc | ModelNet-40 mAP/Acc |
|---|---|---|
| BIM-CLIP (Ours) | 95.25 / 95.36 | 90.34 / 92.22 |
pip install transformers torch torchvision plyfile scikit-learn tqdm timm ruamel.yamlPretrained and fine-tuned BIM-CLIP model weights are available on HuggingFace:
π huggingface.co/flybrid/BIM-CLIP
The extended ModelNet multimodal datasets are in a separate repository:
π huggingface.co/datasets/flybrid/BIM-CLIP-ModelNet
For the ULIP-2 baseline, the official pretrained weights (866 MB) are included in the HuggingFace repository at ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt, or download directly:
https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt
| File | Classes | Dataset |
|---|---|---|
embeddings.pt |
57 | BIMCompNet (all categories) |
ifcnet_embeddings.pt |
20 | IFCNet |
model_net_10_embeddings.pt |
10 | ModelNet-10 |
model_net_40_embeddings.pt |
40 | ModelNet-40 |
Note: Embeddings must match the evaluation dataset. Do not mix across datasets.
BIMCompNet (single modality β IMG / PC / MESH):
python bimclip.py --mode eval --data_type PC \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 --model_path /path/to/best_1000.mdl \
--embeddings_path embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./resultsBIMCompNet (multimodal):
python bimclip.py --mode eval --data_type MULTI_MODAL \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 --model_path /path/to/best_1000.mdl \
--embeddings_path embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./resultsIFCNet (multimodal):
python bimclip.py --mode eval --data_type MULTI_MODAL \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--model_path /path/to/best_ifcnet.mdl \
--embeddings_path ifcnet_embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./resultsIFCNet expects three modality directories:
IFCNetCorePly/IFCNetCorePng/IFCNetCoreObj.
If your directory names follow this convention, only--ifcnet_root(pointing to the Ply directory) is needed; the script auto-derives the other two paths.
ModelNet-10 / ModelNet-40:
python bimclip.py --mode eval --data_type MULTI_MODAL \
--modelnet_root /path/to/ModelNet_plus --modelnet_version 10 \
--model_path /path/to/best_10.mdl \
--embeddings_path model_net_10_embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--output_dir ./resultsFine-tune on IFCNet:
python bimclip.py --mode finetune --data_type MULTI_MODAL \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--model_path /path/to/best_1000.mdl \
--embeddings_path ifcnet_embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--epochs 40 --lr 5e-5 --batch_size 16 --output_dir ./resultsFine-tune on BIMCompNet:
python bimclip.py --mode finetune --data_type MULTI_MODAL \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 --model_path /path/to/best_1000.mdl \
--embeddings_path embeddings.pt --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--epochs 10 --lr 3e-5 --batch_size 16 --output_dir ./results# BIMCompNet (zero-shot)
python baseline_pointclip_v2.py \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 --yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--batch_size 16 --n_views 10 --output_dir ./results
# IFCNet (zero-shot)
python baseline_pointclip_v2.py \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--yaml_path ./ζθΏ°δΏ‘ζ―.yaml \
--batch_size 16 --n_views 10 --output_dir ./results# Zero-shot on IFCNet
python baseline_ulip2.py --mode zeroshot \
--ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
--ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
--yaml_path ./ζθΏ°δΏ‘ζ―.yaml --batch_size 16 --output_dir ./results
# Fine-tune on BIMCompNet-1000
python baseline_ulip2.py --mode finetune \
--data_root /path/to/BIMCompNet --index_root /path/to/index \
--set_size 1000 --ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
--yaml_path ./ζθΏ°δΏ‘ζ―.yaml --batch_size 16 --epochs 10 --lr 1e-4 \
--output_dir ./results| Parameter | Description | Default |
|---|---|---|
--mode |
eval / finetune |
eval |
--data_type |
IMG / PC / MESH / MULTI_MODAL |
MULTI_MODAL |
--data_root |
BIMCompNet root directory | β |
--ifcnet_root |
IFCNet point cloud directory (.ply) | β |
--modelnet_root |
ModelNet root directory | β |
--modelnet_version |
10 or 40 |
10 |
--set_size |
Samples per class (BIMCompNet) | 1000 |
--model_path |
Trained model file (.mdl) | β |
--embeddings_path |
Text embeddings file (.pt) | embeddings.pt |
--epochs |
Fine-tuning epochs | 150 |
--lr |
Learning rate | 5e-5 |
--batch_size |
Batch size | 16 |
--output_dir |
Results output directory | ./results |
BIM-CLIP consists of three sequential training stages:
-
Modality-Specific Pretraining: Independent encoders for each modality.
- Multi-view images β ViT (87.46M params, 30 epochs)
- Point clouds β PointNet (946K params, 36 epochs)
- Meshes β MeshNet (3.59M params, 60 epochs)
-
Feature Alignment: Each modality is projected into the language embedding space (1536-dim) via learnable linear projections, optimized with symmetric contrastive loss against
text-embedding-ada-002anchors. -
Cross-Modal Fusion: Language-guided CMA module (10.62M params, only module updated during fine-tuning) uses language embeddings as queries to compute cross-modal attention, followed by semantic-aware attention pooling.
| Dataset | Domain | Classes | Samples | Modalities |
|---|---|---|---|---|
| BIMCompNet-100 | BIM | 42 | 4,200 | MV + PC + Mesh |
| BIMCompNet-500 | BIM | 31 | 15,500 | MV + PC + Mesh |
| BIMCompNet-1000 | BIM | 24 | 24,000 | MV + PC + Mesh |
| IFCNetCore | BIM | 20 | 7,930 | MV + PC + Mesh |
| ModelNet-10 | Generic | 10 | 4,899 | MV + PC + Mesh* |
| ModelNet-40 | Generic | 40 | 12,311 | MV + PC + Mesh* |
Point clouds and multi-view images for ModelNet are generated via our multimodal data construction pipeline.
BIMCompNet is hosted and maintained by the 606 Lab at Xi'an University of Technology:
π https://bimcompnet-606lab.xaut.edu.cn/
Please visit the page above to apply for access or download the dataset. IFCNetCore is publicly available at its original release page. ModelNet can be obtained from the official ModelNet page.
The extended ModelNet-10 and ModelNet-40 datasets (ModelNet10.zip / ModelNet40.zip) are available on HuggingFace. Each object is supplemented with a point cloud and 12 edge-rendered multi-view images:
ModelNet{10|40}/
βββ {class}/
βββ train/
β βββ obj/ # Original mesh (.obj)
β βββ ply/ # Point cloud sampled from mesh (1024 pts, .ply)
β βββ png/
β βββ {sample}/
β βββ Edges/ # 12 edge-rendered views (0.png β 11.png)
βββ test/
βββ (same structure)
The conversion scripts used to generate point clouds and multi-view images are located in the convert/ directory of this repository.
@article{meng2026bimclip,
title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
journal={[to-be-updated upon acceptance]},
year={2026}
}This work was supported by the National Natural Science Foundation of China Joint Fund Key Project (U2368203), the Innovation Capability Support Program of Shaanxi (2025RS-CXTD-006), and the Key Research and Development Program of Shaanxi (2024SF2-GJHX-48).

