Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions
This code is mainly based on our ICCV 2021 paper Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions.
Dataset webpage: https://shuangli-project.github.io/VHICO-Dataset/
Project webpage: https://shuangli-project.github.io/weakly-supervised-human-object-detection-video/
This project aims at weakly supervised human-object interaction detection in videos. We introduce a contrastive weakly supervised training loss that aims to jointly associate spatiotemporal regions in a video with an action and object vocabulary and encourage temporal continuity of the visual appearance of moving objects as a form of self-supervision.
To train our model, we introduce a dataset comprising over 6.5k videos with human-object interaction annotations that have been semi-automatically curated from sentence captions associated with the videos.
conda install pytorch=0.4.1 cuda90 -c pytorch
pip install cython
pip install numpy scipy pyyaml packaging pycocotools tensorboardX tqdm scikit-image gensim
pip install opencv-python
pip uninstall matplotlib
conda install -c conda-forge matplotlib
pip uninstall pillow
conda install -c anaconda pil
Because of licence issues, please download the corresponding videos from Moments in Time Dataset.
The data we used is from their extract frames with the folder name video_256_30fps
.
For more information about our dataset, please visit the dataset website.
Please download the human annotations and saved results first.
Please unzip the human annotations and put them in the data
folder.
Please unzip the saved results and put them in the results
folder.
mAP: python eval/eval_vhico.py --eval_subset test --EVAL_MAP 1
Recall: python eval/eval_vhico.py --eval_subset test --EVAL_MAP 0
mAP: python eval/eval_vhico.py --eval_subset unseen --EVAL_MAP 1
Recall: python eval/eval_vhico.py --eval_subset unseen --EVAL_MAP 0
To train and test our model, please run the following codes:
- Densepose to extract human segmentation masks of the video frames.
- word2vec to extract the features of action and object labels.
sh scripts/train_rel_mit.sh
sh test_rel_mit.sh