Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I apply to my video files? #1

Closed
rajeevchhabra opened this issue Jan 29, 2018 · 47 comments
Closed

How do I apply to my video files? #1

rajeevchhabra opened this issue Jan 29, 2018 · 47 comments

Comments

@rajeevchhabra
Copy link

Hi:
I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide.

@KaiyangZhou
Copy link
Owner

KaiyangZhou commented Jan 29, 2018

Hi @rajeevchhabra,
You can replace this line https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L84 with your own features, which should be of dimension (num_frames, feature_dim). You also need to modify https://github.com/KaiyangZhou/vsumm-reinforce/blob/master/vsum_train.py#L69 to store baseline rewards according to your own datasets.

@zijunwei
Copy link

zijunwei commented Feb 6, 2018

Hi, is that possible to share the code you create the h5 dataset so that I can follow to create my own?
It doesn't even have to be runable
Thanks!

@KaiyangZhou
Copy link
Owner

Hi @zijunwei,
You can follow the code below to create your own data:

import h5py
h5_file_name = 'blah blah blah'
f = h5py.File(h5_file_name, 'w')

# video_names is a list of strings containing the 
# name of a video, e.g. 'video_1', 'video_2'
for name in video_names:
    f.create_dataset(name + '/features', data=data_of_name)
    f.create_dataset(name + '/gtscore', data=data_of_name)
    f.create_dataset(name + '/user_summary', data=data_of_name)
    f.create_dataset(name + '/change_points', data=data_of_name)
    f.create_dataset(name + '/n_frame_per_seg', data=data_of_name)
    f.create_dataset(name + '/n_frames', data=data_of_name)
    f.create_dataset(name + '/picks', data=data_of_name)
    f.create_dataset(name + '/n_steps', data=data_of_name)
    f.create_dataset(name + '/gtsummary', data=data_of_name)
    f.create_dataset(name + '/video_name', data=data_of_name)

f.close()

For a detailed description of the data format, please refer to the readme.txt in dataset which you downloaded via wget.

Instructions for h5py can be found at http://docs.h5py.org/en/latest/quick.html

Let me know if you have any problems.

@zijunwei
Copy link

zijunwei commented Feb 6, 2018

Thanks!
For the readme.txt file you referred.

/key
    /features                 2D-array with shape (n_steps, feature-dimension)
    /gtscore                  1D-array with shape (n_steps), stores ground truth improtance score
    /user_summary             2D-array with shape (num_users, n_frames), each row is a binary vector
    /change_points            2D-array with shape (num_segments, 2), each row stores indices of a segment
    /n_frame_per_seg          1D-array with shape (num_segments), indicates number of frames in each segment
    /n_frames                 number of frames in original video
    /picks                    posotions of subsampled frames in original video
    /n_steps                  number of subsampled frames
    /gtsummary                1D-array with shape (n_steps), ground truth summary provided by user
    /video_name (optional)    original video name, only available for SumMe dataset

How the gtscore is computed and how is it different from gtsummary or the average of user_summary?
I didn't see you using gtscore or gtsummary in testing, just ask out of curiosity.
Thanks!

@KaiyangZhou
Copy link
Owner

gtscore and gtsummary are used for training only. I should have clarified this.

gtscore is the average of multiple importance scores (used by regression loss). gtsummary is a binary vector indicating indices of keyframes, and is provided by original datasets as well (this label can be used for maximum likelihood loss).

user_summary contains multiple key-clips given by human annotators and we need to compare our machine summary with each one of the user summaries.

Hope this clarifies.

@zijunwei
Copy link

Thanks! It's very helpful!

@chandra-siri
Copy link

chandra-siri commented Mar 30, 2018

@KaiyangZhou I'm trying to create .h5py file for my own video.
After reading the 'datasets/readme.txt' I understood that I need data like..features, n_frames, n_picks, n_steps. ( I could only understand what n_frames are :| )

But what exactly is features. I understand that it'll be a numpy matrix of shape (n_steps, feature_dimension). But what are these and how do I extract them for a given video frames ? Could you please give me more description about them

I've glanced through you paper, but I couldn't find about these .

@KaiyangZhou
Copy link
Owner

Hi @chandra-siri,

features contains feature vectors representing video frames. Each video frame can be represented by a feature vector (containing some semantic meanings), extracted by a pretrained convolutional neural network (e.g. GoogLeNet). picks is an array storing the position information of subsampled video frames. We do not process each video frame since adjacent frames are very similar. We can subsample a video with 2 frame per second or 1 frame per second, which will result in less frames but they are informative. picks is useful when we want to interpolate the subsampled frames into the original video (say you have obtained importance scores for subsampled frames and you want to get the scores for the entire video, picks can indicate which frames are scored and the scores of surrounding frames can be filled with these frames).

how do I extract them for a given video frames?

You can use off-the-shelf feature extractors to achieve this, e.g. pytorch. First, load the feature extractor, e.g. a pretrained neural network. Second, loop into each video frame and use the feature extractor to extract features from those frames. Each frame will be represented by a long feature vector. If you use GoogLeNet, you will end up with 1024-dimensional feature vector. Third, concatenate the extracted features to form a feature matrix, and save it to the h5 file as specified in the readme.txt.

The pseudo code below might be more clear:

features = []
for frame in video_frames:
    # frame is a numpy array of shape (channel, height, width)
    # do some preprocessing such as normalization
    frame = preprocess(frame)
    # apply the feature extractor to this frame
    feature = feature_extractor(frame)
    # save the feature
    features.append(feature)
features = concatenate(features) # now shape is (n_steps, feature_dimension)

Hope this would help.

@chandra-siri
Copy link

@KaiyangZhou This is very informative and helpful. I'll try out what you've mention, using googleNet (inception model v-3) and let you know. Thanks a lot !

@chandra-siri
Copy link

@KaiyangZhou As you told I've was able to extract frames. But in order to to get summary I also need change_points . Could you tell me what is change_points and also what is num_segments

@KaiyangZhou
Copy link
Owner

@chandra-siri
change_points corresponds to shot transitions, which are obtained by temporal segmentation approaches that segment a video into disjoint shots. num_segments is number of total segments a video is cut into. Please refer to this paper and this paper if you are unfamiliar with the pipeline.

Specifically, change_points look like

change_points = [
    0, 10;
    11, 20;
    21, 30;
]

This means the video is segmented into three parts. The first part ranges from frame 0 to frame 10, the second part ranges from frame 11 to frame 20, and so on and forth.

@samrat1997
Copy link

How do I know which key in dataset corresponds to which video in SumMe dataset ?

@KaiyangZhou
Copy link
Owner

@samrat1997
SumMe: video name is stored in video_i/video_name.
TVSum: video1-50 corresponds to the same order in ydata-tvsum50.mat, which is the original matlab file provided by TVSum.

@samrat1997
Copy link

@KaiyangZhou ... Thank you. I just realized that.

@harora
Copy link

harora commented Jun 1, 2018

@KaiyangZhou Hi . I've been trying to use the code to test on my dataset. I used the google inception v3 pretrained pytorch model to generate features and it has 1000 classes output. Hence my features shape is (num_frames,1000). However dataset used here has output 1024. Can you help regarding this? Will i have to modify and retrain inception model?

@KaiyangZhou
Copy link
Owner

@harora the feature dimension does not matter, you can just feed (num_features, any_num_dim) to the algorithm, you don't need to retrain the model

it is strange to use the class prelogits as feature vectors, it would make more sense to use the layer before softmax, e.g. 1024-dim for googlenet, 2048 for resnet

@Petersteve
Copy link

@KaiyangZhou hi,
did we generate change_points manually ? if not show me the code associated.

gtscore is it generated by the user manually? if not show me the code associated.

@KaiyangZhou
Copy link
Owner

@bersalimahmoud change_points are obtained by temporal segmentation method. gtscore is the average of human scores, so it can be used for supervised training (you won't need this anyway).

@liuhaixiachina
Copy link

liuhaixiachina commented Jun 18, 2018

@KaiyangZhou
regarding Visualize summary,
in readme, it says:


Visualize summary

You can use summary2video.py to transform the binary machine_summary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video


Where or how can I get the frames?

can i get frames from the .h5 files? or, shall I create frames from the raw videos?

Thank you very much !

@KaiyangZhou
Copy link
Owner

@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.

@babyjie57
Copy link

@KaiyangZhou Hi, I am trying to use the code to test on my own video. I used a pretrained model to generate features and it has 4096 classes output. I see you said " the feature dimension does not matter" in the above. But, I got "RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 4096".

Could you please tell me how to solve this issue?

Thanks a lot!

@KaiyangZhou
Copy link
Owner

@babyjie57 you need to change the argument input_dim=4096

@babyjie57
Copy link

babyjie57 commented Jun 20, 2018

@KaiyangZhou Thanks for your reply. I also added '--input-dim 4096', but I got 'While copying the parameter named "rnn.weight_ih_l0_reverse", whose dimensions in the model are torch.Size([1024, 4096]) and whose dimensions in the checkpoint are torch.Size([1024, 1024]).'

Can you please tell me how to solve this issue?

Thanks!

@KaiyangZhou
Copy link
Owner

I presume you are loading a model which was trained with features of 1024 dimensions but initialized with feature dimension = 4096.

@mctigger
Copy link

Can you also publish the script for the KTS you used to generate the change points?

@KaiyangZhou
Copy link
Owner

@mctigger you can find the code here http://lear.inrialpes.fr/people/potapov/med_summaries.php

@HrsPythonix
Copy link

HrsPythonix commented Sep 21, 2018

I got a question, if I want to use my own dataset, but there is no label in the dataset, when I construct the hdf5 file, what should I do with user_summary, gts_score and gtsummary?

Also I see these three labels only used in evaluation process, does this means that I can just delete them both in hdf5 and the evaluation function? (I mean in the pytorch implementation)

Moreover, if I want to use the result.json to generate a summarization video for a raw video, can I delete these three labels?

@MuziSakura
Copy link

Have you solved this problem? I want to use my own video data but I don't know how to deal with user_summary, gts_score and gtsummary.

@anuragshas
Copy link

How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?

@gh2517956473
Copy link

@KaiyangZhou
How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?
Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second?
Could you please share your code with me.
Thank you very much !

@KaiyangZhou
Copy link
Owner

How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ?

You can decompose a video using either ffmpeg or opencv. For the latter, there is an example code on the opencv website. You can write sth like

import numpy as np
import cv2

cap = cv2.VideoCapture(0)
video_features = []

while(still_has_frame):
    # Capture frame-by-frame
    ret, frame = cap.read()
    # maybe skip this frame for downsampling
    # feature extraction
    feature = feature_extractor(frame) # or perform extraction on minibatch which leverages gpu
    # store feature
    video_features.append(feature)

summary = video_summarizer(video_features)

Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second?

Yes. You can use CNN features which capture high-level semantics. Downsampling is a common technique as neighbour frames are redundant. 2fps/1fps is good.

@KaiyangZhou
Copy link
Owner

Annotations are not required for training. Only frame features are required by the algorithm. You can qualitatively evaluate the results by applying the trained model to unseen videos and watch the summaries.

@gh2517956473
Copy link

@KaiyangZhou
Could you tell me where to download the original video(SumMe and TVSum)?
Thank you very much !

@loveFaFa
Copy link

Could you tell me where to download the original video(SumMe and TVSum)?
Thank you very much !

@bdgp01
Copy link

bdgp01 commented Mar 25, 2019

Same question as above.

Could you please tell me where to download the original video(SumMe and TVSum)?
Thank you very much in advance !

@rajlakshmi123
Copy link

@KaiyangZhou
Can you please tell me what is picks, how can we calculate it. What is it's dimensions?

@wjb123
Copy link

wjb123 commented May 30, 2019

@KaiyangZhou, how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?

@SinDongHwan
Copy link

@KaiyangZhou To get change points, Does frames of a video input to X in "demo.py"? or Does features of each frames input?

@chenchch94
Copy link

@wjb123Hi, Do you solve this problem???→ "how to use KTS to generate change points ? I use the official KTS code and employ CNN feature for each frame, but get same number of segments for every video. Is there any problem?"

@SinDongHwan
Copy link

@chenchch94
this is my repository of forked.
you can find generate_dataset.py in "utils" directory.
Good Luck!

@neda60
Copy link

neda60 commented Mar 18, 2020

Hi,
I could generate the .h5 file for my own dataset, however, my dataset has no annotations. Is it possible to use your code without annotated videos? If so, how?
Thanks!

@SinDongHwan
Copy link

Hi, @neda60
For Train, Evaluation, It's not possible.
Bus just for test, It's possible.
To test, you need "features", "picks", "n_frame", "change_points", "n_frame_per_seg".

@anaghazachariah
Copy link

@KaiyangZhou Can you please share the code for .h5 file..How to deal with gtscore,gtsummary and usersummary?

@vb637
Copy link

vb637 commented Nov 16, 2020

Hello,I want to use your RL code to extract key frames. Now I use a complex network to extract features, and store it in .h5 file. But i didn't have other attribute such as gtscore and gtsummary( Because I guess dataset at least has these three attribute). Now, I try to create gtscore by creating an all one numpy array, but I don't know whether this is right or wrong. If wrong, how can I compute gtscore. Meanwhile, I create gtsummary by random sampling some frames, should I uniformly sampling?

@Fredham
Copy link

Fredham commented Jun 27, 2022

@liuhaixiachina you need to decompose a video before doing other things e.g. feature extraction. You can use ffmpeg or python to do it.

I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"

@Fredham
Copy link

Fredham commented Jun 27, 2022

Hi: I have been able to run your algorithm on my machine (both training and test datasets). Now I would like to apply it to my dataset (my videos - they are not compressed to .h5). How do I do that? What function would I need to modify? Please guide.
in readme, it says:
Visualize summary
You can use summary2video.py to transform the binary machine_summary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video

Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?

@Fredham
Copy link

Fredham commented Jun 27, 2022

@KaiyangZhou How did you convert the video into signal for Kernel Temporal Segmentation (KTS) ? Did you use CNN features for KTS ? Are the CNN features subsampled and extracted from a video with 2 frame per second or 1 frame per second? Could you please share your code with me. Thank you very much !

Visualize summary
You can use summary2video.py to transform the binary machine_summary to real summary video. You need to have a directory containing video frames. The code will automatically write summary frames to a video where the frame rate can be controlled. Use the following command to generate a .mp4 video

Where or how can I get the frames?I followed steps mentioned in README. It doesn't have video_frames neither. Shall I create frames from the raw videos?Is there any missing steps in README?How could I decompose video using ffmpeg or python?But there is no video in datasets.I also read the code of summary2video.py.Should I decompose "result.h5"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests