DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

[Project Page] [arXiv]

dino-tracker.mp4

Usage

Setup
Preprocessing
Training
Inference

Setup

Clone the repository:

git clone https://github.com/AssafSinger94/dino-tracker.git

Switch to the project directory:

cd dino-tracker

To setup the environment, run:

conda create -n dino-tracker python=3.9
conda activate dino-tracker
pip install -r requirements.txt

Add current path to PYTHONPATH:

export PYTHONPATH=`pwd`:$PYTHONPATH

Preprocessing

Given an input video, we start by extracting optical flow and DINO best-buddy correspondences. The input video directory should have the following structure:

├──<VIDEO_DIR>
    ├──video/
        ├──00000.png
        ├──00001.png
        ├──...
    ├──masks/ # optional
        ├──00000.png
        ├──00001.png
        ├──...

where masks contains the per-frame foreground masks. If masks is not provided, foreground masks are automatically computed using DINO features saliency maps.

In case the video is in mp4 format, convert it to frames by simply running:

python ./preprocessing/mp4_to_frames.py \
    --video-path <PATH_TO_MP4> \
    --output-folder <VIDEO_DIR_PATH>/video

To run the preprocessing pipeline, run the following:

python ./preprocessing/main_preprocessing.py \
    --config ./config/preprocessing.yaml \
    --data-path <VIDEO_DIR_PATH>

The script outputs chained optical flow trajectories, DINO embeddings and DINO best-buddies in the following structure:

├──<VIDEO_DIR>
    ├──video/
    ├──masks/
    ├──dino_best_buddies/
    ├──dino_embeddings/
    ├──of_trajectories/

Training

Once preprocessing is finished, run the following command to train DINO-Tracker:

python ./train.py \
    --config ./config/train.yaml \
    --data-path <VIDEO_DIR_PATH>

The checkpoints are saved under:

├──<VIDEO_DIR>
    ├──models
        ├──dino_tracker
            ├──delta_dino_<ITER>.pt
            ├──tracker_head_<ITER>.pt

Inference

Trajectory creation and visualization

To predict and visualize trajectories with a trained DINO-Tracker, run the following scripts sequentially:

python ./inference_grid.py \
    --config ./config/train.yaml \
    --data-path <VIDEO_DIR_PATH> \
    --use-segm-mask # optional, used for sampling only foreground points

python visualization/visualize_rainbow.py \
    --data-path <VIDEO_DIR_PATH> \
    --plot-trails # optional, used for visualizing motion trails.

The first script creates trajectories for a grid of query points in the first frame, while the second script visualizes them. The --plot-trails option is used for visualizing motion trails. Note that this option requires a segmentation mask for the first frame. If --plot-trails is not provided, the script only visualizes the tracked positions in circles. The visualizations are outputted under <VIDEO_DIR_PATH>/visualizations directory.

TAP-Vid evaluation

To evaluate on TAP-Vid-DAVIS, please see the following steps. The same steps can be applied for TAP-Vid Kinetics and BADJA datasets.

Download benchmark data file tapvid_davis_data_strided.pkl from this link, put it under ./tapvid/tapvid_davis_data_strided.pkl,
Download pre-trained weights and videos from this link under davis_480.zip, unzip the folder to ./dataset/tapvid-davis/,
Extract DINO embeddings for all videos by running the following:

python ./preprocessing/save_dino_embed_video.py \
    --config ./config/preprocessing.yaml \
    --data-path ./dataset/tapvid-davis/<VIDEO_ID>

The above should be run for all videos in the benchmark, e.g. <VIDEO_ID> = {0, 1, ..., 29} for DAVIS.

Predict trajectories on benchmark query points by running the following for all benchmark videos:

python inference_benchmark.py \
    --config ./config/train.yaml \
    --data-path ./dataset/tapvid-davis/<VIDEO_ID> \
    --benchmark-pickle-path ./tapvid/tapvid_davis_data_strided.pkl \
    --video-id <VIDEO_ID>

Evaluate the model accuracy by running the following:

python ./eval/eval_benchmark.py \
    --dataset-root-dir ./dataset/tapvid-davis \
    --benchmark-pickle-path ./tapvid/tapvid_davis_data_strided.pkl \
    --out-file ./tapvid/comp_metrics_davis.csv \
    --dataset-type tapvid # tapvid | BADJA

The evaluation should output: average_pts_within_thresh: 0.8066 | occlusion_acc: 0.8854 | average_jaccard: 0.6528.

The output CSV file contains all TAP-Vid metrics (position accuracy, occlusion accuracy, Average Jaccard) for all videos.

Citation

@misc{dino_tracker_2024,
    author        = {Tumanyan, Narek and Singer, Assaf and Bagon, Shai and Dekel, Tali},
    title         = {DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video},
    month         = {March},
    year          = {2024},
    eprint        = {2403.14548},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
config		config
data		data
dataset/horsejump		dataset/horsejump
eval		eval
models		models
optimization		optimization
preprocessing		preprocessing
preprocessing_dino_bb		preprocessing_dino_bb
tapvid		tapvid
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dino_tracker.py		dino_tracker.py
inference_benchmark.py		inference_benchmark.py
inference_grid.py		inference_grid.py
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

License

AssafSinger94/dino-tracker

Folders and files

Latest commit

History

Repository files navigation

DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

[Project Page] [arXiv]

Usage

Setup

Preprocessing

Training

Inference

Trajectory creation and visualization

TAP-Vid evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages