Stream3D: Streaming Zero-Shot 3D Instance Segmentation with Multi-View Noise Mask Filtering and Manifold Refining
Jie Xu, Na Zhao
Singapore University of Technology and Design (SUTD)
We proposed a novel streaming zero-shot/open-vocabulary 3D instance segmentation framework (Stream3D) and this work is built on the previous brilliant work, especially on MaskClustering, OpenMask3D, OVIR-3D, etc. We share the same experiment environment as MaskClustering, so we adopt its README.md file as follows.
Step 1: Install dependencies
First, install PyTorch following the official instructions, e.g., for CUDA 11.8.:
conda install pytorch==2.0.0 torchvision==0.15.0 pytorch-cuda=11.8 -c pytorch -c nvidiaThen, install Pytorch3D. You can try 'pip install pytorch3d', but it doesn't work. Therefore I install it from source:
cd third_party
git clone git@github.com:facebookresearch/pytorch3d.git
cd pytorch3d && pip install -e .Finally, install other dependencies:
cd ../..
pip install -r requirements.txtStep 2: Download demo data from the MaskClustering demo data. Then unzip the data to ./data and your directory should look like this: data/demo/scene0608_00, etc.
Step 3: Run the clustering demo and visualize the class-agnostic result using Pyviz3d:
bash demo.shIn this section, we provide a comprehensive guide on installing the full version of Stream3D, data preparation, and conducting experiments on the ScnaNet, ScanNet++, and MatterPort3D datasets.
To run the full pipeline of Stream3D, you need to install 2D instance segmentation tool Cropformer and Open CLIP.
The official installation of Cropformer is composed of two steps: installing detectron2 and then Cropformer. For your convenience, I have combined the two steps into the following scripts. If you have any problems, please refer to the original Cropformer installation guide.
cd third_party
git clone git@github.com:facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd ../
git clone git@github.com:qqlu/Entity.git
cp -r Entity/Entityv2/CropFormer detectron2/projects
cd detectron2/projects/CropFormer/entity_api/PythonAPI
make
cd ../..
cd mask2former/modeling/pixel_decoder/ops
sh make.sh
pip install -U openmim
mim install mmcvWe add an additional script into cropformer to make it sequentialy process all sequences.
cd ../../../../../../../../
cp mask_predict.py third_party/detectron2/projects/CropFormer/demo_cropformerFinally, download the CropFormer checkpoint and modify the 'cropformer_path' variable in script.py.
Install the open clip library by
pip install open_clip_torchFor the checkpoint, when you run the script, it will automatically download the checkpoint. However, if you want to download it manually, you can download it from here and set the path when loading CLIP model using 'create_model_and_transforms' function.
Please follow the official ScanNet guide to sign the agreement and send it to scannet@googlegroups.com. After receiving the response, you can download the data. You only need to download the ['.aggregation.json', '.sens', '.txt', '_vh_clean_2.0.010000.segs.json', '_vh_clean_2.ply', '_vh_clean_2.labels.ply'] files. Please also set the 'label_map' on to download the 'scannetv2-labels.combined.tsv' file.
After downloading the data, you can run the following script to prepare the data. Please change the 'raw_data_dir', 'target_data_dir', 'split_file_path', 'label_map_file' and 'gt_dir' variables before you run.
cd preprocess/scannet
python process_val.py
python prepare_gt.pyAfter running the script, you will get the following directory structure:
data/scannet
├── processed
├── scene0011_00
├── pose <- folder with camera poses
│ ├── 0.txt
│ ├── 10.txt
│ └── ...
├── color <- folder with RGB images
│ ├── 0.jpg (or .png/.jpeg)
│ ├── 10.jpg (or .png/.jpeg)
│ └── ...
├── depth <- folder with depth images
│ ├── 0.png (or .jpg/.jpeg)
│ ├── 10.png (or .jpg/.jpeg)
│ └── ...
├── intrinsic
│ └── intrinsic_depth.txt <- camera intrinsics
| └── ...
└── scene0011_00_vh_clean_2.ply <- point cloud of the scene
└── gt <- folder with ground truth 3D instance masks
├── scene0011_00.txt
└── ...
Please follow the official ScanNet++ guide to sign the agreement and download the data. In order to help reproduce the results, we provide the configs we use to download and preprocess the scannet++ in preprocess/scannetpp. Please modify the paths in these configs and paste them to the corresponding folders before running the script. Then clone the ScanNet++ toolkit.
To extract the rgb and depth image, run the following script:
python -m iphone.prepare_iphone_data iphone/configs/prepare_iphone_data.yml
python -m common.render common/configs/render.ymlSince the original mesh is of super high resolution, we downsample it and generate the ground truth accordingly as the following:
python -m semantic.prep.prepare_training_data semantic/configs/prepare_training_data.yml
python -m semantic.prep.prepare_semantic_gt semantic/configs/prepare_semantic_gt.ymlAfter running the script, you will get the following directory structure:
data/scannetpp
├── data
├── 0d2ee665be
├── iphone
| ├── rgb
│ ├── frame_000000.jpg
│ ├── frame_000001.jpg
│ └── ...
| ├── render_depth
│ ├── frame_000000.png
│ ├── frame_000001.png
│ └── ...
| └── ...
└── scans
└── ...
├── gt
├── metadata
├── pcld_0.25 <- downsampled point cloud of the scene
└── splits
Please follow the official MatterPort3D guide to sign the agreement and download the data. We use a subset of its testing scenes to ensure Mask3D remains within memory constraints. The list of scenes we use can be found in splits/matterport3d.txt. Download only the following: ['undistorted_color_images', 'undistorted_depth_images', 'undistorted_camera_parameters', 'house_segmentations']. Upon download, unzip the files. Your directory structure should resemble (or you can modify the paths in 'preprocess/matterport3d/process.py' and 'dataset/matterport.py'):
data/matterport3d/scans
├── 2t7WUuJeko7
├── 2t7WUuJeko7
├── house_segmentations
| ├── 2t7WUuJeko7.ply
| └── ...
├── undistorted_camera_parameters
| └── 2t7WUuJeko7.conf
├── undistorted_color_images
| ├── xxx_i0_0.jpg
| └── ...
└── undistorted_depth_images
├── xxx_d0_0.png
└── ...
├── ARNzJeq3xxb
├── ...
└── YVUC4YcDtcY
Then run the following script to prepare the ground truth:
cd preprocess/matterport3d
python process.pySimply find the corresponding config in the 'configs' folder and run the following command. Remember to change the 'cropformer_path' variable in the config and the 'CUDA_LIST' variable in the run.py.
python run.py --config config_nameFor example, to run the ScanNet experiment, you can run the following command:
python run.py --config scannetThis run.py will get the 2D instance masks, run mask clustering, get open-vocabulary features and evaluate the results. The evaluation results will be saved in the 'data/evaluation' folder.
To visualize the 3D class-agnostic result of one specific scene, run the following command:
python -m visualize.vis_scene --config scannet --seq_name scene0608_00