-
We use GPT-4V to recognize objects in the image as input prompt for Grounded-Segment-Anything. Alternatively, you could use any other object recognition model (e.g. RAM) to get the objects in the given image.
-
We put pre-computed GPT-4V result under each
data/${name}/obj_movable.json
. You could skip below and run segmentation if you don't want to re-run GPT-4V. -
Copy the OpenAI API key into
gpt/gpt_configs/my_apikey
. -
Install requirements
pip install inflect openai==0.28
-
Run GPT-4V based RAM
python gpt_ram.py --img_path ../data/${name}
-
The default
save_path
is saved under same folder as input../data/${name}/intermediate/obj_movable.json
. The output is a list in json format.[ {"obj_1": True # True if the object is movable or False if not}, {"obj_2": True}, ... ]
We use Grounded-Segment-Anything to segment the input image given the prompts input. We use the earlier checkout version for the paper. You could adapt the code to the latest version of Grounded-SAM or Grounded-SAM-2.
-
Follow the Grounded-SAM setup
cd Grounded-Segment-Anything/ git checkout 753dd6675ea7935af401a983e88d159629ad4d5b # Follow Grounded-SAM readme to install requirements # Download pretrained weights to current folder wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
-
Segmentation requires the input image
input
and a promptprompts_path
for each object in the image. The default prompt path is../data/${name}/intermediate/obj_movable.json
.python run_gsam.py --input ../data/${name}
-
The default
output
is saved under the same folder as input../data/${name}
and visualizations under../data/${name}/intermediate
as follows:image folder/ ├── intermediate/ ├── mask.png # FG and BG segmentation mask ├── mask.json # segmentation id and and object name, movability ├── vis_mask.jpg # segmentation visualization
Pool Domino Pig Ball Balls
-
We use GeoWizard to estimate depth and normal of input image. Follow GeoWizard setup to install requirements. Recommend to create a new conda environment.
-
Run GeoWizard on input image
python run_depth_normal.py --input ../data/${name} --output ../outputs/${name} --vis
-
Depth and normal are saved in
outputs/${name}
. Visualizations are saved inoutputs/${name}/intermediate
.image folder/ ├──depth.npy ├── normal.npy ├── intermediate/ ├── depth_vis.png ├── normal_vis.png
Input Normal Depth
-
We separate the foreground and background using the segmentation mask. Foreground objects with complete masks are used for physics reasoning and simulation, while truncated objects are treated as static. We use edges from static objects and the background as physical boundaries for simulation.
-
Module requires the following input for each image:
image folder/ ├── depth.npy ├──normal.npy ├── original.png # optional:for visualization only ├── intermediate/ ├── mask.png # complete image segmentation mask ├── mask.json # segmentation id and and object name, movability
-
Run foreground/background separation and edge detection
python run_fg_bg.py --input ../data/${name} --vis_edge
-
The default output is saved under the same folder as input
../data/${name}
, contains the final foreground objects maskmask.png
and edge listedges.json
saved inoutputs/${name}
.image folder/ ├── mask.png # final mask ├── edges.json ├── intermediate/ ├── edge_vis.png # red line for edges ├── fg_mask_vis.png # text is the segmentation id ├── bg_mask_vis.png ├── bg_mask.png
Input Foreground Background Edges -
We could simulate all foreground objects by specifying their velocity and acceleration use the segmentation id in simulation.
We use Inpaint-Anything to inpaint the background of input image. You could adapt the code to any other latest inpainting model.
-
Follow Inpaint-Anything setup to install requirements and download the pretrained model. Recommend to create a new conda environment.
python -m pip install torch torchvision torchaudio python -m pip install -e segment_anything python -m pip install -r lama/requirements.txt # Download pretrained model under Inpaint-Anything/pretrained_models/
-
Inpainting requires the input image
../data/${name}/original.png
and a foreground mask of../data/${name}/mask.png
under the same folder.python run_inpaint.py --input ../data/${name} --output ../outputs/${name} --dilate_kernel_size 20
dilate_kernel_size
could be adjusted in the above script. For heavy shadow image, increasedilate_kernel_size
to get better inpainting results. -
The output
inpaint.png
is saved inoutputs/${name}
.Input Inpainting
-
Install requirements
pip install openai==0.28 ruamel.yaml
-
Copy the OpenAI API key into
gpt/gpt_configs/my_apikey
. -
Physics reasoning requires the following input for each image:
image folder/ ├── original.png ├── mask.png # movable segmentation mask
-
Run GPT-4V physical property reasoning by the following command:
python gpt_physic.py --input ../data/${name} --output ../outputs/${name}
-
The output
physics.yaml
contains the physical properties and primitive shape of each object segment in the image. Note GPT-4V outputs may vary for different runs and differ from the original setting indata/${name}/sim.yaml
. Users could adjust accordingly to each run output.
-
We use Intrinsic to infer albedo and shading of input image. Follow Intrinsic setup to install requirements. Recommend to create a new conda environment.
git clone https://github.com/compphoto/Intrinsic cd Intrinsic/ git checkout d9741e99b2997e679c4055e7e1f773498b791288 pip install .
-
Run Intrinsic decomposition on input image
python run_albedo_shading.py --input ../data/${name} --output ../outputs/${name} --vis
-
shading.npy
are saved inoutputs/${name}
. Visualization of albedo and shading are saved inoutputs/${name}/intermediate
.Input Albedo Shading -
Intrinsic has released updated trained model with better results. Feel free to use the updated model or any other model for better performance.