Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetVLAD++ Model RAM consumption? #35

Closed
Wilann opened this issue Dec 6, 2021 · 17 comments
Closed

NetVLAD++ Model RAM consumption? #35

Wilann opened this issue Dec 6, 2021 · 17 comments

Comments

@Wilann
Copy link

Wilann commented Dec 6, 2021

Hi SoccerNet Dev Team,

I've managed to plug my own dataset into NetVLAD++, but am unable to train due to overloading my 32 GB of RAM.

I have ~80 matches of ~50 minutes of badminton matches with ResNet-152 features sampled at 5fps. After loading my dataset, I have ~18/32 GB of RAM used. The program gets killed while loading the model. I'm confused why, as it's only ~5.5 GB as shown in the TorchInfo summary below. I believe I should still have ~8 GB to spare. Is this a feature of NetVLAD++ specifically? I noticed that in #28 experiments were done with 60-90 GB of RAM.

Thank you for reading, and looking forwawrd to your insights!

TorchInfo Summary:

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
NetVLAD_plus_plus                        --                        --
├─Linear: 1-1                            [5236, 512]               1,049,088
├─NetVLAD: 1-2                           [44, 14336]               28,672
├─NetVLAD: 1-3                           [44, 14336]               28,672
├─Dropout: 1-4                           [44, 28672]               --
├─Linear: 1-5                            [44, 3]                   86,019
├─Sigmoid: 1-6                           [44, 3]                   --
==========================================================================================
Total params: 1,192,451
Trainable params: 1,192,451
Non-trainable params: 0
Total mult-adds (G): 5.50
==========================================================================================
Input size (MB): 42.89
Forward/backward pass size (MB): 31.54
Params size (MB): 4.77
Estimated Total Size (MB): 79.20
==========================================================================================
@SilvioGiancola
Copy link
Owner

SilvioGiancola commented Dec 6, 2021

Hi @Wilann, NetVLAD++ considers a batch size of 256 by default.Maybe you can try a smaller batch size.

parser.add_argument('--batch_size', required=False, type=int, default=256, help='Batch size' )

If your issue come from the inference (or evaluation), that takes a full video as input, you might want to change the parameter BS that is the inference batch size (number of window inferred at once).

@Wilann
Copy link
Author

Wilann commented Dec 8, 2021

Hi @SilvioGiancola, thank you for your response! I tried with a batch size of 1 and the program crashes after loading the 300 training games. Would you happen to know what the minimum required RAM is to run the code?

Here's a screenshot:
Screenshot from 2021-12-08 16-10-42

@SilvioGiancola
Copy link
Owner

Hi @Wilann , you are right, that implementation of NetVLAD++ actually pre-loads all the features in the RAM, for a faster training.
A quick work around for smaller RAM would be to train on a subset of your dataset. I see you are training on the 300 games from SoccerNet, you should be able to train on less videos. You said you only have 80 games in your dataset? Do you have he same issue when you train on less games?

@Wilann
Copy link
Author

Wilann commented Dec 10, 2021

Hi @SilvioGiancola, I actually managed to install more RAM into my PC, and NetVLAD++ is able to train on SoccerNet now (with alll 500 games, with ~55 GB of RAM used). For my own dataset I indeed have 80 games, and training works on that dataset as well (~20 GB of RAM used).

I also computed ResNet-152 features at 5 fps, and would like to train on those, but my GPU RAM gets overloaded. I tried lowering the BS = 1 as you suggeted above, but it seemed to make no difference. Do you have any suggestions on how I can optimize the code for 5 FPS?

Also, once the code reaches the dataloaders using the datasets using SoccerNetClipsTesting, it seems I get a spike in GPU RAM. I set-up my dataloaders something like this (as to try and replicate CALF's setup):

dataset_train = SoccerNetClips(...)
dataset_valid = SoccerNetClips(...)
dataset_valid_metric = SoccerNetClipsTesting(...)
dataset_test = SoccerNetClipsTesting(...)

But I noticed you used SoccerNetClips for dataset_valid_metric. By making this change, I also notice my GPU RAM spikes on my custom dataset, but not on SoccerNet. Do you have an idea why?

Thank you for your response - really appreciate it.

A visual on the spike with my changes made:
Screenshot from 2021-12-10 07-54-28

@SilvioGiancola
Copy link
Owner

SilvioGiancola commented Dec 12, 2021

Hi @Wilann ,
Increasing the fps from 2 to 5 will incur +150% of memory usage. Your only solution would be to change the dataset.py.

In line 72, I am loading all the features in non-overlapping windows that I sample for training. That operation is memory intensive as it loads all the features in RAM. Your solution would be to push most of those operation in the getitem (line 136), but it will require further engineering for a decent loading speed.

Regarding the peak in evaluation, you can skip it all together and validate on the loss to prevent overfitting. Inhere, I was checking overfitting either with the loss or with the Spotting Average-mAP, but actually using the loss was good enough. For testing or inference, you will need to have a sliding window with a stride of 1 frame, or skip a few for a lighter inference. Again, it is loading all features in RAM for a sake of simplicity, but a better engineering could solve your issue.

I hope that will help you!

Cheers,

@Wilann
Copy link
Author

Wilann commented Dec 14, 2021

Hi @SilvioGiancola,

Thank you so much for your response!

I understand what you mean by moving the loading portion to the __getitem__ - will see what I can do about it.

About this part:

Regarding the peak in evaluation, you can skip it all together and validate on the loss to prevent overfitting. Inhere, I was checking overfitting either with the loss or with the Spotting Average-mAP, but actually using the loss was good enough.

I'm a bit confused when you say "validate on the loss". How do we validate on the loss? My understanding was that validation typically occurs on a portion of the dataset.
Also, you mention checking overfitting with the loss/Average-mAP - could you elaborate a bit more on how you did that?

Thanks so much for your time again - I really appreciate your help!

@SilvioGiancola
Copy link
Owner

Hi @Wilann , Since the RAM peak appears in evaluation (testSpotting), you can skip that part. testSpotting requires a dense sliding window every second or so, hence will incur a lot of RAM usage as the whole video will be stored in overlapping windows. testSpotting is called twice, when "testing" and when "validating". For testing, it is obvious why, you want to evaluate your spotting performances on the test set. You can skip that testing for now. For validating (to avoid overfitting), you can stop the training when the validation loss start increasing. However, since the loss is a classification loss and not a spotting loss, you are not finding the best model for spotting but for classification. In order to be closer to the spotting task, you can then validate with the spotting performance (best epoch after spotting performance on validation drops) instead of the classification loss (best epoch after the validation loss increase). Note that it this choice is not much sensitive for the final performances, but validating on the loss instead of the spotting Avg-mAP will incur less RAM usage on your end.

That is also the reason why I have implemented 2 validation dataset, for classification or for spotting:

dataset_train = SoccerNetClips(...)
dataset_valid = SoccerNetClips(...)
dataset_valid_metric = SoccerNetClipsTesting(...)
dataset_test = SoccerNetClipsTesting(...)

@Wilann
Copy link
Author

Wilann commented Dec 19, 2021

Hi @SilvioGiancola,

When you say:

In order to be closer to the spotting task, you can then validate with the spotting performance (best epoch after spotting performance on validation drops)

Is the spotting performance the Average-mAP?

If so, how I understand it is that I basically I skip using the tesitng dataset for now, and only use testSpotting on the validation set. This change will save me quite a bit of RAM usage, but the downside is that I won't have another metric (Average-mAP from the test set) to gauge how well the model is generalizing - is this correct?

Another few questions:

  1. Here in SoccerNetClips, what does this do? (I don't quite understand the comment)

2.1. Here again in SoccerNetClips, what is being set to 0 here?

2.2. Here on the next line, why is 1 being added to the label index?

  1. When sampling the dataloaders, for dataset_test, I'm getting something like:
Dataloader Length: 10
Batch Length: 3
Features Shape: torch.Size([1, 1866, 8, 2048])
Labels Shape: torch.Size([1, 1867, 2])
Index Shape: torch.Size([1])

Which I interpret as:

Dataloader Length: len(self.listGames)
Batch Length: 3
Features Shape: torch.Size([batch_size, 1866, window_size, 2048])
Labels Shape: torch.Size([batch_size, 1867, num_classes])
Index Shape: torch.Size([batch_size])

I believe 1866 and 1867 are being generated by this line, but what do they mean?

Thank you so much for your help again - really appreciate it.

@SilvioGiancola
Copy link
Owner

Hi @Wilann ,

Yes, by spotting performance I meant the Average-mAP, and yes, you can skip the test set for sake of memory, and use either the SoccerNetClipsTesting or SoccerNetClips in validation, to check the overfitting on the spotting metric or on the classification metric, respectively.

  1. That simply discard the events (annotations) that happen after the end of the game (1st and 2nd half). It is just a sanity check, the video features might not cover the complete video time due to their temporal receptive field and the temporal stride/padding. Here I am discarding the annotations at the end of each half game that does not contains any frame feature, hence from which I cant't learn from.
  2. label_half1[frame//self.window_size_frame] is a one-hot encoding of the n+1 classes, including a background class with index 0. Line 86 initialize all one-hot encoding for all frame to be [1, 0, ..., 0] (background), so on Line 119 I change the one-hot encoding to reflect the correct class (not BG anymore but class label+1.
  3. feat2clip Line 23 re-arrange the features along the videos into clips of specific length with a specific stride, in other words a sliding window of dimension clip_length extracted every stride indexes. It reduces the dimension from [video_length, feature_dimension] to [number_of_clips, clip_length, feature_dimension]. You might want to play with the padding parameter to have the same number of clips than the number of frame in you video/features (if you use a stride of 1), using "zeropad" to pad with zeros or "replicate_last" to replicate the last features and align with the number of frames in the video.
    I hope that will help you.

@Wilann
Copy link
Author

Wilann commented Jan 9, 2022

Hi @SilvioGiancola,

Thank you for the reply again!

Following up on my questions:

  1. I'm not sure what you mean by this:

the video features might not cover the complete video time due to their temporal receptive field and the temporal stride/padding

If I understand correctly, the extracted frame features may not span the entire video duration? What's the cause of this?

Also, printing something like this print(label_half1.shape[0], frame//self.window_size_frame) on my dataset for a single video gives me:

285 1
285 3
285 7
285 11
285 16
...
285 262
285 266
285 269
285 278
285 282

I can see that the sanity check works, and I believe label_half1.shape[0] is the number of frame features for this video, but am unsure how frame//self.window_size_frame results in the frame feature number.

  1. What is a background class, and what is its purpose? I believe the predictions for this background class are being discarded in testSpotting line 251 & 252 - if this is true, then why include a background class at all?

Again, in line 119, why is the background class being set to 0 here? Why not just omit it all together if it's not being used? (I don't think its being used at least)

  1. Actually, I just logged out all changes to idx in feat2clip, and realize the function is absolutely amazing - it really does work and the I/O shapes are exactly as you say! Regarding the padding, would you recommend either zeropad or replicate_last over the other, or would it depend on the task and dataset?

Thank you for your time again - really appreciate all the help.

@SilvioGiancola
Copy link
Owner

  1. In L119, I need frame//self.window_size_frame to be lower than the size of the tensor. It can come from the annotations or from a feature extraction with large strides that can't cover the end of the video. For instance, if you have 109 frames and a stride of 10 frames for your encoder, your last 9 frames won't be encoded, so if an action happens in the last 9 frames, you won't have any feature to recognize it. In case that occurs, L113 make sure it doesn't break by discarding the annotation.

  2. Look at any spotting papers, we only want to localize when an action occurs. In most cases, no action happens in the video, hence it needs to be classified as BG. This is not a classification task "a goal or a foul" but a localization task. So I decided the BG is the class 0. See FasterRCNN for instance, it also classify an "empty" anchors as BG.

  3. I don't have an answer, it's a design choice, check experimentally what's the best and stick with it. Google is your friend :)

@Wilann
Copy link
Author

Wilann commented Jan 15, 2022

Hi again,

  1. I think I understand now. So if we pass frames at fps=2, for example, to the feature extractor it shouldn't be an issue, since there's no stride - is that correct?

  2. Ah, I see. When there's no action it still needs to be given a class so the model can predict it, so it's just given class 0.

  3. Thank you for the link! I actually remember reading about padding a while ago, but forgot why each type was used. Very helpful :)

  4. An additional quesetion, since I'm labeling my own dataset on my own, its very time consuming and tedious. I decided to try and flip/mirror the videos and use the same labels to see if I could improve performance with minimal effort. In doing so, initially by including the labels from flipped videos improved performance, but it came to a point where the metric would instantly drop to 0. I found that the more labels from flipped videos I use, the earlier the metric would drop. Do you have any idea why this is the case?

Screenshot from 2022-01-15 07-33-10

Thank you for your time again - really appreciate the help!

@SilvioGiancola
Copy link
Owner

It can come from the optimization, that diverged for some reason. Reduce the LR and check. Also, flipping the video inverts the past and future, so you might want to use the NetVLAD baseline instead of NetVLAD++/CALF.

@ldfandian
Copy link

Hi @SilvioGiancola, I actually managed to install more RAM into my PC, and NetVLAD++ is able to train on SoccerNet now (with alll 500 games, with ~55 GB of RAM used). For my own dataset I indeed have 80 games, and training works on that dataset as well (~20 GB of RAM used).

I also computed ResNet-152 features at 5 fps, and would like to train on those, but my GPU RAM gets overloaded. I tried lowering the BS = 1 as you suggeted above, but it seemed to make no difference. Do you have any suggestions on how I can optimize the code for 5 FPS?

Also, once the code reaches the dataloaders using the datasets using SoccerNetClipsTesting, it seems I get a spike in GPU RAM. I set-up my dataloaders something like this (as to try and replicate CALF's setup):

dataset_train = SoccerNetClips(...)
dataset_valid = SoccerNetClips(...)
dataset_valid_metric = SoccerNetClipsTesting(...)
dataset_test = SoccerNetClipsTesting(...)

But I noticed you used SoccerNetClips for dataset_valid_metric. By making this change, I also notice my GPU RAM spikes on my custom dataset, but not on SoccerNet. Do you have an idea why?

Thank you for your response - really appreciate it.

A visual on the spike with my changes made: Screenshot from 2021-12-10 07-54-28

Any idea what is the recommend GPU RAM size for train and inference ? @SilvioGiancola

@SilvioGiancola
Copy link
Owner

Hi @ldfandian , the message you quote contains your answer: ~55GB of RAM.

@ldfandian
Copy link

Hi @ldfandian , the message you quote contains your answer: ~55GB of RAM.

Thanks for the quick reply. @SilvioGiancola

And, I need to implement a getItem in order to reduce memory RAM. For GPU RAM, it looks to be tunable (by batch_size) to be as small as you like~ correct?

@SilvioGiancola
Copy link
Owner

The answer is no. Please read the whole thread and you will understand why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants