Official Repository for "A Spatial–Temporal Video Quality Assessment Method via Comprehensive HVS Simulation" (Accepted by Transactions on Cybernetics) [TCYB version]
- python==3.8.8
- torch==1.8.1
- torchvision==0.9.1
- torchsort==0.1.8
- detectron2==0.6
- scikit-video==1.1.11
- scikit-image==0.19.1
- scikit-learn==1.0.2
- scipy==1.8.0
- tensorboardX==2.4.1
VQA Datasets.
We test HVS-5M on six datasets, including KoNViD-1k, CVD2014, LIVE-VQC, LIVE-Qualcomm, YouTube-UGC, and LSVQ, download the datasets from the official website.
The content and edge features of the video are obtained by ConvNeXt.
First, you need to download the dataset and copy the local address into the videos_dir of CNNfeatures_Spatial.py. Due to the particularity of the LSVQ dataset, we give a spatial features version for extracting LSVQ in CNNfeatures_Spatial_LSVQ.py. In it, we mark the video sequence numbers that do not exist in the current version of LSVQ.
python CNNfeature_Spatial.py --database=database --frame_batch_size=16 \
python CNNfeature_Spatial_LSVQ.py --database=LSVQ --frame_batch_size=16
Please note that when extracting spatial features, you can choose the size of frame_batch_size according to your GPU. After running the CNNfeatures_Spatial.py or CNNfeatures_Spatial_LSVQ.py, you can get the spatial features of each video in the directory "/HVS-5M_dataset/SpatialFeature/".
The motion features of the video are obtained by SlowFast.
First you need to download the SlowFast model into "./MotionExtractor/checkpoints/Kinetics/"
Similarly, for the other five datasets and LSVQ, we also give two versions to extract temporal features, namely CNNfeatures_Temporal.py and CNNfeatures_Temporal_LSVQ.py, respectively.
python CNNfeature_Temporal.py --database=database --frame_batch_size=64 \
python CNNfeature_Temporal_LSVQ.py --database=LSVQ --frame_batch_size=64
Please note that frame_batch_size can only be 64 when extracting temporal features. After running the CNNfeatures_Temporal.py or CNNfeatures_Temporal_LSVQ.py, you can get the temporal features of each video in the directory "/HVS-5M_dataset/TemporalFeature/".
The spatial and temporal features are fused to obtain fusion features.
python CNNfeature_Fusion.py --database=database --frame_batch_size=64 \
After running the CNNfeatures_Fusion.py, you can get the fusion features of each video in the directory "/HVS-5M_dataset/".
python main.py --trained_datasets=K --tested_datasets=K \
You can select multiple datasets for testing and evaluating. Specifically, K, C, N, L, Y, and Q represent KoNViD-1k, CVD2014, LIVE-VQC, LIVE-Qualcomm, YouTube-UGC, and LSVQ, respectively.
The individual dataset evaluation with KoNViD-1k is selected as the test demo model. The model weights provided in "models/HVS-5M_K".
python test_demo.py --model_path=models/HVS-5M_K --video_path=data/test.mp4
Table I: Evaluations on the large-scale dataset LSVQ and cross-dataset results.
Training set | ||||||||
Testing set | KoNViD-1k | LIVE-VQC | ||||||
Metric | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
HVS-5M | 0.879 | 0.872 | 0.798 | 0.815 | 0.857 | 0.855 | 0.810 | 0.832 |
Table II: Evaluations on the Fine-tuning results. (Pre-trained on LSVQ)
Fine-tuning Dataset | LIVE-VQC (585) | KoNViD-1k (1200) | YouTube-UGC (1380) | Weighted Average | ||||
Metric | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
HVS-5M | 0.878 | 0.879 | 0.882 | 0.882 | 0.880 | 0.878 | 0.880 | 0.880 |
This cobebase is heavily inspired by BVQA-2022 (Li et al., TCSVT2022).
Visual saliency detection, networks for the spatial and temporal feature extraction mainly follows the implementations of SAMNet (Liu et al., TIP2021), ConvNeXt (Liu et al., CVPR2022), and SlowFast (Feichtenhofer et al., ICCV2019).
Great appreciation for their excellent works.
This source code is made available for research purpose only.