Skip to content

FlyChary/BIM-CLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition

Preprint | Haining Meng, Haoyang Dong, Mingsong Yang, Xing Fan, Xinhong Hei
School of Computer Science and Engineering, Xi'an University of Technology

Paper HuggingFace BIMCompNet License


Overview

BIM-CLIP is a language-guided multimodal framework for Building Information Modeling (BIM) component classification. It integrates three geometric modalitiesβ€”point clouds, meshes, and multi-view imagesβ€”aligned to language-derived semantic embeddings via contrastive learning, with a language-guided Cross-Modal Attention (CMA) module for adaptive multimodal fusion.

Figure1: Overview of the BIM-CLIP framework.

Figure1

Key Features

  • Language-guided semantic alignment: Uses text-embedding-ada-002 embeddings (1536-dim) as semantic anchors to bridge geometric representations and high-level semantics.
  • Cross-Modal Attention (CMA): Language embeddings act as semantic queries to guide inter-modal feature interaction, transforming fusion from feature-driven aggregation into semantic-conditioned selection.
  • Alignment-Preserving Fine-Tuning: Only the CMA module is updated during fine-tuning, preserving pretrained alignment with just 10.62M trainable parameters.
  • Strong generalization: Competitive zero-shot and cross-dataset transfer performance on IFCNet and ModelNet.

Figure2: BIM-CLIP workflow and downstream applications.

Given heterogeneous inputs the framework encodes each modality through dedicated encoders and aligns them within a shared semantic embedding space via contrastive learning. After alignment-preserving fine-tuning, the learned representations support two practical BIM downstream tasks.

Figure2

Results

IFCNet

Method Setting Acc (%) Prec (%) F1 (%)
PointCLIP V2 Zero-shot 3.83 0.06 0.76
ULIP-2 Zero-shot 1.64 1.42 0.79
ULIP-2 Zero-shot* 39.76 30.14 29.21
ULIP-2 Fine-tuned 87.43 87.99 86.60
BIM-CLIP (Ours) Zero-shot 35.48 42.99 29.70
BIM-CLIP (Ours) Fine-tuned 91.00 91.90 90.39

Zero-shot*: fine-tuned on BIMCompNet-1000, transferred to IFCNet without retraining.

BIMCompNet

Method BIMCompNet-100 F1 BIMCompNet-500 F1 BIMCompNet-1000 F1
ULIP-2 79.72 88.56 91.02
BIM-CLIP (Ours) 87.44 91.28 91.83

ModelNet

Method ModelNet-10 mAP/Acc ModelNet-40 mAP/Acc
BIM-CLIP (Ours) 95.25 / 95.36 90.34 / 92.22

Installation

pip install transformers torch torchvision plyfile scikit-learn tqdm timm ruamel.yaml

Model Weights

Pretrained and fine-tuned BIM-CLIP model weights are available on HuggingFace:
πŸ‘‰ huggingface.co/flybrid/BIM-CLIP

The extended ModelNet multimodal datasets are in a separate repository:
πŸ‘‰ huggingface.co/datasets/flybrid/BIM-CLIP-ModelNet

For the ULIP-2 baseline, the official pretrained weights (866 MB) are included in the HuggingFace repository at ULIP2/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt, or download directly:

https://huggingface.co/datasets/SFXX/ulip/resolve/main/ULIP-2/pretrained_models/ULIP-2-PointBERT-8k-xyz-pc-slip_vit_b-objaverse-pretrained.pt

Text Embeddings

File Classes Dataset
embeddings.pt 57 BIMCompNet (all categories)
ifcnet_embeddings.pt 20 IFCNet
model_net_10_embeddings.pt 10 ModelNet-10
model_net_40_embeddings.pt 40 ModelNet-40

Note: Embeddings must match the evaluation dataset. Do not mix across datasets.


Usage

BIM-CLIP Evaluation

BIMCompNet (single modality β€” IMG / PC / MESH):

python bimclip.py --mode eval --data_type PC \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述俑息.yaml \
  --output_dir ./results

BIMCompNet (multimodal):

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述俑息.yaml \
  --output_dir ./results

IFCNet (multimodal):

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --model_path /path/to/best_ifcnet.mdl \
  --embeddings_path ifcnet_embeddings.pt --yaml_path ./描述俑息.yaml \
  --output_dir ./results

IFCNet expects three modality directories: IFCNetCorePly / IFCNetCorePng / IFCNetCoreObj.
If your directory names follow this convention, only --ifcnet_root (pointing to the Ply directory) is needed; the script auto-derives the other two paths.

ModelNet-10 / ModelNet-40:

python bimclip.py --mode eval --data_type MULTI_MODAL \
  --modelnet_root /path/to/ModelNet_plus --modelnet_version 10 \
  --model_path /path/to/best_10.mdl \
  --embeddings_path model_net_10_embeddings.pt --yaml_path ./描述俑息.yaml \
  --output_dir ./results

BIM-CLIP Fine-Tuning

Fine-tune on IFCNet:

python bimclip.py --mode finetune --data_type MULTI_MODAL \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --model_path /path/to/best_1000.mdl \
  --embeddings_path ifcnet_embeddings.pt --yaml_path ./描述俑息.yaml \
  --epochs 40 --lr 5e-5 --batch_size 16 --output_dir ./results

Fine-tune on BIMCompNet:

python bimclip.py --mode finetune --data_type MULTI_MODAL \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --model_path /path/to/best_1000.mdl \
  --embeddings_path embeddings.pt --yaml_path ./描述俑息.yaml \
  --epochs 10 --lr 3e-5 --batch_size 16 --output_dir ./results

Baseline: PointCLIP V2

# BIMCompNet (zero-shot)
python baseline_pointclip_v2.py \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --yaml_path ./描述俑息.yaml \
  --batch_size 16 --n_views 10 --output_dir ./results

# IFCNet (zero-shot)
python baseline_pointclip_v2.py \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --yaml_path ./描述俑息.yaml \
  --batch_size 16 --n_views 10 --output_dir ./results

Baseline: ULIP-2

# Zero-shot on IFCNet
python baseline_ulip2.py --mode zeroshot \
  --ifcnet_root /path/to/IFCNetCorePly/IFCNetCore \
  --ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
  --yaml_path ./描述俑息.yaml --batch_size 16 --output_dir ./results

# Fine-tune on BIMCompNet-1000
python baseline_ulip2.py --mode finetune \
  --data_root /path/to/BIMCompNet --index_root /path/to/index \
  --set_size 1000 --ulip2_ckpt /path/to/ULIP-2-pretrained.pt \
  --yaml_path ./描述俑息.yaml --batch_size 16 --epochs 10 --lr 1e-4 \
  --output_dir ./results

Key Parameters

Parameter Description Default
--mode eval / finetune eval
--data_type IMG / PC / MESH / MULTI_MODAL MULTI_MODAL
--data_root BIMCompNet root directory β€”
--ifcnet_root IFCNet point cloud directory (.ply) β€”
--modelnet_root ModelNet root directory β€”
--modelnet_version 10 or 40 10
--set_size Samples per class (BIMCompNet) 1000
--model_path Trained model file (.mdl) β€”
--embeddings_path Text embeddings file (.pt) embeddings.pt
--epochs Fine-tuning epochs 150
--lr Learning rate 5e-5
--batch_size Batch size 16
--output_dir Results output directory ./results

Architecture

BIM-CLIP consists of three sequential training stages:

  1. Modality-Specific Pretraining: Independent encoders for each modality.

    • Multi-view images β†’ ViT (87.46M params, 30 epochs)
    • Point clouds β†’ PointNet (946K params, 36 epochs)
    • Meshes β†’ MeshNet (3.59M params, 60 epochs)
  2. Feature Alignment: Each modality is projected into the language embedding space (1536-dim) via learnable linear projections, optimized with symmetric contrastive loss against text-embedding-ada-002 anchors.

  3. Cross-Modal Fusion: Language-guided CMA module (10.62M params, only module updated during fine-tuning) uses language embeddings as queries to compute cross-modal attention, followed by semantic-aware attention pooling.


Datasets

Dataset Domain Classes Samples Modalities
BIMCompNet-100 BIM 42 4,200 MV + PC + Mesh
BIMCompNet-500 BIM 31 15,500 MV + PC + Mesh
BIMCompNet-1000 BIM 24 24,000 MV + PC + Mesh
IFCNetCore BIM 20 7,930 MV + PC + Mesh
ModelNet-10 Generic 10 4,899 MV + PC + Mesh*
ModelNet-40 Generic 40 12,311 MV + PC + Mesh*

Point clouds and multi-view images for ModelNet are generated via our multimodal data construction pipeline.

Obtaining BIMCompNet

BIMCompNet is hosted and maintained by the 606 Lab at Xi'an University of Technology:

πŸ‘‰ https://bimcompnet-606lab.xaut.edu.cn/

Please visit the page above to apply for access or download the dataset. IFCNetCore is publicly available at its original release page. ModelNet can be obtained from the official ModelNet page.

ModelNet multimodal extension

The extended ModelNet-10 and ModelNet-40 datasets (ModelNet10.zip / ModelNet40.zip) are available on HuggingFace. Each object is supplemented with a point cloud and 12 edge-rendered multi-view images:

ModelNet{10|40}/
└── {class}/
    β”œβ”€β”€ train/
    β”‚   β”œβ”€β”€ obj/        # Original mesh (.obj)
    β”‚   β”œβ”€β”€ ply/        # Point cloud sampled from mesh (1024 pts, .ply)
    β”‚   └── png/
    β”‚       └── {sample}/
    β”‚           └── Edges/  # 12 edge-rendered views (0.png – 11.png)
    └── test/
        └── (same structure)

The conversion scripts used to generate point clouds and multi-view images are located in the convert/ directory of this repository.


Citation

@article{meng2026bimclip,
  title={BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition},
  author={Meng, Haining and Dong, Haoyang and Yang, Mingsong and Fan, Xing and Hei, Xinhong},
  journal={[to-be-updated upon acceptance]},
  year={2026}
}

Acknowledgements

This work was supported by the National Natural Science Foundation of China Joint Fund Key Project (U2368203), the Innovation Capability Support Program of Shaanxi (2025RS-CXTD-006), and the Key Research and Development Program of Shaanxi (2024SF2-GJHX-48).

About

BIM-CLIP: Language-Guided Multimodal Representation Learning for BIM Component Recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages