BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition

Preprint | Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
School of Computer Science and Engineering, Xi'an University of Technology

Overview

BIM-CLIP is a language-guided multimodal framework for Building Information Modeling (BIM) component classification. It integrates three geometric modalities—point clouds, meshes, and multi-view images—aligned to language-derived semantic embeddings via contrastive learning, with a language-guided Cross-Modal Attention (CMA) module for adaptive multimodal fusion.

Figure1: Overview of the BIM-CLIP framework.

Key Features

Language-guided semantic alignment: Uses text-embedding-ada-002 embeddings (1536-dim) as semantic anchors to bridge geometric representations and high-level semantics.
Cross-Modal Attention (CMA): Language embeddings act as semantic queries to guide inter-modal feature interaction, transforming fusion from feature-driven aggregation into semantic-conditioned selection.
Alignment-Preserving Fine-Tuning: Only the CMA module is updated during fine-tuning, preserving pretrained alignment with just 10.62M trainable parameters.
Strong generalization: Competitive zero-shot and cross-dataset transfer performance on IFCNet and ModelNet.

Figure2: BIM-CLIP workflow and downstream applications.

Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.

Results

IFCNet

Method	Setting	Acc (%)	Prec (%)	F1 (%)
PointCLIP V2	Zero-shot	3.83	0.06	0.76
ULIP-2	Zero-shot	1.64	1.42	0.79
ULIP-2	Zero-shot*	39.76	30.14	29.21
ULIP-2	Fine-tuned	87.43	87.99	86.60
BIM-CLIP (Ours)	Zero-shot	35.48	42.99	29.70
BIM-CLIP (Ours)	Fine-tuned	91.00	91.90	90.39

Zero-shot*: fine-tuned on BIMCompNet-1000, transferred to IFCNet without retraining.

BIMCompNet

Method	BIMCompNet-100 F1	BIMCompNet-500 F1	BIMCompNet-1000 F1
ULIP-2	79.72	88.56	91.02
BIM-CLIP (Ours)	87.44	91.28	91.83

ModelNet

Method	ModelNet-10 mAP/Acc	ModelNet-40 mAP/Acc
BIM-CLIP (Ours)	95.25 / 95.36	90.34 / 92.22

Installation

pip install transformers torch torchvision plyfile scikit-learn tqdm timm ruamel.yaml

Model Weights

Pretrained and fine-tuned BIM-CLIP model weights are available on HuggingFace:
👉 huggingface.co/flybrid/BIM-CLIP

The extended ModelNet multimodal datasets are in a separate repository:
👉 huggingface.co/datasets/flybrid/BIM-CLIP-ModelNet

For the ULIP-2 baseline, the official pretrained weights (866 MB) are included in the HuggingFace repository at ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt, or download directly:

https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt

Text Embeddings

File	Classes	Dataset
`embeddings.pt`	57	BIMCompNet (all categories)
`ifcnet_embeddings.pt`	20	IFCNet
`model_net_10_embeddings.pt`	10	ModelNet-10
`model_net_40_embeddings.pt`	40	ModelNet-40

Note: Embeddings must match the evaluation dataset. Do not mix across datasets.

Usage

BIM-CLIP Evaluation

BIMCompNet (single modality — IMG / PC / MESH):

python bimclip.py --mode eval --data_type PC \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述信息.yaml \
  --output_dir ./results

BIMCompNet (multimodal):

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述信息.yaml \
  --output_dir ./results

IFCNet (multimodal):

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --model_path /path/to/best_ifcnet.mdl \
  --embeddings_path ifcnet_embeddings.pt --yaml_path ./描述信息.yaml \
  --output_dir ./results

IFCNet expects three modality directories: IFCNetCorePly / IFCNetCorePng / IFCNetCoreObj.
If your directory names follow this convention, only --ifcnet_root (pointing to the Ply directory) is needed; the script auto-derives the other two paths.

ModelNet-10 / ModelNet-40:

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --modelnet_root /path/to/ModelNet_plus --modelnet_version 10 \
  --model_path /path/to/best_10.mdl \
  --embeddings_path model_net_10_embeddings.pt --yaml_path ./描述信息.yaml \
  --output_dir ./results

BIM-CLIP Fine-Tuning

Fine-tune on IFCNet:

python bimclip.py --mode finetune --data_type MULTI_MODAL \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --model_path /path/to/best_1000.mdl \
  --embeddings_path ifcnet_embeddings.pt --yaml_path ./描述信息.yaml \
  --epochs 40 --lr 5e-5 --batch_size 16 --output_dir ./results

Fine-tune on BIMCompNet:

python bimclip.py --mode finetune --data_type MULTI_MODAL \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述信息.yaml \
  --epochs 10 --lr 3e-5 --batch_size 16 --output_dir ./results

Baseline: PointCLIP V2

# BIMCompNet (zero-shot)
python baseline_pointclip_v2.py \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --yaml_path ./描述信息.yaml \
  --batch_size 16 --n_views 10 --output_dir ./results

# IFCNet (zero-shot)
python baseline_pointclip_v2.py \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --yaml_path ./描述信息.yaml \
  --batch_size 16 --n_views 10 --output_dir ./results

Baseline: ULIP-2

# Zero-shot on IFCNet
python baseline_ulip2.py --mode zeroshot \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
  --yaml_path ./描述信息.yaml --batch_size 16 --output_dir ./results

# Fine-tune on BIMCompNet-1000
python baseline_ulip2.py --mode finetune \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
  --yaml_path ./描述信息.yaml --batch_size 16 --epochs 10 --lr 1e-4 \
  --output_dir ./results

Key Parameters

Parameter	Description	Default
`--mode`	`eval` / `finetune`	`eval`
`--data_type`	`IMG` / `PC` / `MESH` / `MULTI_MODAL`	`MULTI_MODAL`
`--data_root`	BIMCompNet root directory	—
`--ifcnet_root`	IFCNet point cloud directory (.ply)	—
`--modelnet_root`	ModelNet root directory	—
`--modelnet_version`	`10` or `40`	`10`
`--set_size`	Samples per class (BIMCompNet)	`1000`
`--model_path`	Trained model file (.mdl)	—
`--embeddings_path`	Text embeddings file (.pt)	`embeddings.pt`
`--epochs`	Fine-tuning epochs	`150`
`--lr`	Learning rate	`5e-5`
`--batch_size`	Batch size	`16`
`--output_dir`	Results output directory	`./results`

Architecture

BIM-CLIP consists of three sequential training stages:

Modality-Specific Pretraining: Independent encoders for each modality.
- Multi-view images → ViT (87.46M params, 30 epochs)
- Point clouds → PointNet (946K params, 36 epochs)
- Meshes → MeshNet (3.59M params, 60 epochs)
Feature Alignment: Each modality is projected into the language embedding space (1536-dim) via learnable linear projections, optimized with symmetric contrastive loss against text-embedding-ada-002 anchors.
Cross-Modal Fusion: Language-guided CMA module (10.62M params, only module updated during fine-tuning) uses language embeddings as queries to compute cross-modal attention, followed by semantic-aware attention pooling.

Datasets

Dataset	Domain	Classes	Samples	Modalities
BIMCompNet-100	BIM	42	4,200	MV + PC + Mesh
BIMCompNet-500	BIM	31	15,500	MV + PC + Mesh
BIMCompNet-1000	BIM	24	24,000	MV + PC + Mesh
IFCNetCore	BIM	20	7,930	MV + PC + Mesh
ModelNet-10	Generic	10	4,899	MV + PC + Mesh*
ModelNet-40	Generic	40	12,311	MV + PC + Mesh*

Point clouds and multi-view images for ModelNet are generated via our multimodal data construction pipeline.

Obtaining BIMCompNet

BIMCompNet is hosted and maintained by the 606 Lab at Xi'an University of Technology:

👉 https://bimcompnet-606lab.xaut.edu.cn/

Please visit the page above to apply for access or download the dataset. IFCNetCore is publicly available at its original release page. ModelNet can be obtained from the official ModelNet page.

ModelNet multimodal extension

The extended ModelNet-10 and ModelNet-40 datasets (ModelNet10.zip / ModelNet40.zip) are available on HuggingFace. Each object is supplemented with a point cloud and 12 edge-rendered multi-view images:

ModelNet{10|40}/
└── {class}/
    ├── train/
    │   ├── obj/        # Original mesh (.obj)
    │   ├── ply/        # Point cloud sampled from mesh (1024 pts, .ply)
    │   └── png/
    │       └── {sample}/
    │           └── Edges/  # 12 edge-rendered views (0.png – 11.png)
    └── test/
        └── (same structure)

The conversion scripts used to generate point clouds and multi-view images are located in the convert/ directory of this repository.

Citation

@article{meng2026bimclip,
  title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
  author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
  journal={[to-be-updated upon acceptance]},
  year={2026}
}

Acknowledgements

This work was supported by the National Natural Science Foundation of China Joint Fund Key Project (U2368203), the Innovation Capability Support Program of Shaanxi (2025RS-CXTD-006), and the Key Research and Development Program of Shaanxi (2024SF2-GJHX-48).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition

Overview

Key Features

Results

IFCNet

BIMCompNet

ModelNet

Installation

Model Weights

Text Embeddings

Usage

BIM-CLIP Evaluation

BIM-CLIP Fine-Tuning

Baseline: PointCLIP V2

Baseline: ULIP-2

Key Parameters

Architecture

Datasets

Obtaining BIMCompNet

ModelNet multimodal extension

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
covert		covert
datasets		datasets
images		images
models		models
results		results
utills		utills
LICENSE		LICENSE
README.md		README.md
all_reseted_label_count.yaml		all_reseted_label_count.yaml
baseline_experiments.md		baseline_experiments.md
baseline_pointclip_v2.py		baseline_pointclip_v2.py
baseline_ulip2.py		baseline_ulip2.py
bimclip.py		bimclip.py
embeddings.pt		embeddings.pt
github_README_zh.md		github_README_zh.md
ifcnet_embeddings.pt		ifcnet_embeddings.pt
model_net_10_embeddings.pt		model_net_10_embeddings.pt
model_net_40_embeddings.pt		model_net_40_embeddings.pt
描述信息.yaml		描述信息.yaml
描述信息_ModelNet.yaml		描述信息_ModelNet.yaml

Folders and files

Latest commit

History

Repository files navigation

BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition

Overview

Key Features

Results

IFCNet

BIMCompNet

ModelNet

Installation

Model Weights

Text Embeddings

Usage

BIM-CLIP Evaluation

BIM-CLIP Fine-Tuning

Baseline: PointCLIP V2

Baseline: ULIP-2

Key Parameters

Architecture

Datasets

Obtaining BIMCompNet

ModelNet multimodal extension

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages