Skip to content

AtharvMane/Ges3ViG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ges3ViG

PyTorch Lightning WandB

This is the official implementation for Ges3ViG: Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding.

Model Architecture

Requirement

This repo contains CUDA implementation, please make sure your GPU compute capability is at least 3.0 or above.

Setup

Conda (recommended)

We recommend the use of miniconda to manage system dependencies.

# create and activate the conda environment
conda create -n ges3vig python=3.10
conda activate ges3vig

# install PyTorch 2.0.1
conda install pytorch torchvision pytorch-cuda=11.7 -c pytorch -c nvidia

# install PyTorch3D with dependencies
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install pytorch3d -c pytorch3d

# install MinkowskiEngine with dependencies
conda install -c anaconda openblas
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps \
--install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

# install Python libraries
pip install .

# install CUDA extensions
cd ges3vig/common_ops
pip install .

Pip

Note: Setting up with pip (no conda) requires OpenBLAS to be pre-installed in your system.

# create and activate the virtual environment
virtualenv env
source env/bin/activate

# install PyTorch 2.0.1
pip install torch torchvision

# install PyTorch3D
pip install pytorch3d

# install MinkowskiEngine
pip install MinkowskiEngine

# install Python libraries
pip install .

# install CUDA extensions
cd ges3vig/common_ops
pip install .

Data Preparation

Note: ImputeRefer dataset requires the ScanNet v2 dataset. Please preprocess it first.

ScanNet v2 dataset

  1. Download the ScanNet v2 dataset (train/val/test). The raw dataset files should be organized as follows:

    ges3vig # project root
    ├── dataset
    │   ├── scannetv2
    │   │   ├── scans
    │   │   │   ├── [scene_id]
    │   │   │   │   ├── [scene_id]_vh_clean_2.ply
    │   │   │   │   ├── [scene_id]_vh_clean_2.0.010000.segs.json
    │   │   │   │   ├── [scene_id].aggregation.json
    │   │   │   │   ├── [scene_id].txt
  2. Pre-process the data, it converts original meshes and annotations to .pth data:

    python dataset/scannetv2/preprocess_all_data.py data=scannetv2 +workers={cpu_count}

Human Augmenting the ScanNet v2 dataset

  1. Download the Human Pose Data and the Human Models.
  2. Unzip the zip files and arrange the data in accordance to the following folder structure:
    ges3vig # project root
    ├── dataset
    │   ├── humans
    │   │   ├── man_tall_white
    │   │   ├── man_medium_white
    │   │   ├── woman_medium_white
    │   │   ├── woman_tall_white
    |   |   ...    
    │   ├── scannetv2
    │   │   ├── scans
    │   │   ├── human_info_train
    │   │   ├── human_info_test
  3. Augment the information by running the following command in the project root"
    python dataset/scannetv2/combine_data.py +split={split}
    This should populate the folders {split}_imputed with the relevant .pth files

Generating Human Position data using Imputer

  1. Install Imputer by running the following the project root:
    cd Imputer
    pip install -e .
    cd ..

This will compile utilities essential for Imputer to run on GPU.

  1. Edit the target folder location in config/data/human.yaml to set the imputer_target_dir field. This sets the target folder where the imputer will save the imputed human positions.

  2. Run imputer using the following command:

python3 dataset/scannetv2/imputer_full.py data=imputerefer
  1. Run the following command block to populate {split}_imputed with the final processed dataset
python3 dataset/scannetv2/preprocess_all_imputed_data.py data=imputerefer

ImputeRefer Ground Truth and Language Descriptions

  1. Download the ImputeRefer ground truth and language descriptions (train/test). The raw dataset files should be organized as follows:

    ges3vig # project root
    ├── dataset
    │   ├── imputerefer
    │   │   ├── metadata
    │   │   │   ├── ImputeRefer_filtered_train.json
    │   │   │   ├── ImputeRefer_filtered_test.json
  2. Pre-process the data, "unique/multiple" labels will be added to raw .json files for evaluation purpose:

    python dataset/scanrefer/add_evaluation_labels.py data=imputerefer

Pre-trained detector

We pre-trained PointGroup implemented in MINSU3D on Imputed ScanNet v2 scenes and use it as the detector. We use coordinates + colors + multi-view features as inputs.

  1. Download the pre-trained detector. The detector checkpoint file should be organized as follows:
    ges3vig # project root
    ├── checkpoints
    │   ├── PointGroup_ScanNet.ckpt

Training, Inference and Evaluation

Note: Configuration files are managed by Hydra, you can easily add or override any configuration attributes by passing them as arguments.

wandb login

# train a model with the pre-trained detector, using predicted object proposals for imputed data
python train.py data=imputerefer experiment_name={any_string}


# train a model with the pre-trained detector, using predicted object proposals
python train.py data=imputerefer experiment_name={any_string} +detector_path=checkpoints/PointGroup_ScanNet.ckpt


# train a model from a checkpoint, it restores all hyperparameters in the .ckpt file for imputed data
python train.py data=imputedrefer experiment_name={checkpoint_experiment_name} ckpt_path={ckpt_file_path} #is suggested

# test a model from a checkpoint and save its predictions for imputed data
python test.py data=imputerefer data.inference.split={train/val/test} ckpt_path={ckpt_file_path} pred_path={predictions_path}

Checkpoints

ImputeRefer dataset

Ges3ViG_ImputeRefer.ckpt

Performance:

Model IoU @0.25 (unique) IoU @0.5 (unique) IoU @0.25 (multiple) IoU @0.5 (multiple) IoU @0.25 (overall) IoU @0.5 (overall)
Without Gestures:
3DVG-Transformer 71.56 50.66 31.35 21.54 39.17 27.20
HAM 67.10 48.13 25.42 16.04 33.51 22.27
3DJCG 75.93 59.19 40.34 30.61 47.24 36.16
M3DRefCLIP 77.32 60.15 62.62 47.27 65.53 49.78
With Gestures:
ScanERU 71.60 52.79 31.91 23.06 39.54 28.84
Ges3ViG 84.60 71.03 67.57 55.77 70.85 58.71

Acknowledgements

We would like to acknowledge M3DRefCLIP for the 3D Visual Grounding Codebase.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •