You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, can this method retrieve a video in real time?
The paper says that "On YouTube Highlights and TVSum, we obtain clip-
level visual features using an I3D [4] pre-trained on Kinetics 400 [13] ”, which means that if an unknown video is verified, should audio and video features be extracted offline separately?
How to retrieve the highlighted part of a video in real time?
Thanks a lot.
The text was updated successfully, but these errors were encountered:
We use pre-trained (and frozen) feature extractors for both visual and audio features. So that these features can be pre-extracted even at the training stage. When testing, you may also use these pre-trained expert models to extract the features first, and fed the features into UMT to detect the highlights.
By the way, Equation 1 in the paper is similar to the non-local attention. However, Equation1 uses the cross product, why not the dot product? What is the meaning of the cross product in Equation 1?
Eq. 1 is exactly the same as non-local attention. The cross product indicates the matrix multiplication between the q and k matrices, i.e. [N_q * d] x [d * N_k] -> [N_q * N_k]. This operation is identical to the dot product at the feature dimension.
Hello, can this method retrieve a video in real time?
The paper says that "On YouTube Highlights and TVSum, we obtain clip-
level visual features using an I3D [4] pre-trained on Kinetics 400 [13] ”, which means that if an unknown video is verified, should audio and video features be extracted offline separately?
How to retrieve the highlighted part of a video in real time?
Thanks a lot.
The text was updated successfully, but these errors were encountered: