Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 2.04 KB

multigrid_training.md

File metadata and controls

44 lines (34 loc) · 2.04 KB

February 2020

tl;dr: An efficient training technique by scaling spatial and temporal dimension of videos.

Overall impression

The paper is from FAIR and well written, as usual. Lots of experiments, and lots of GPUs (128)! Although they also validated the methods on 1 GPU as well with 3x speed up.

Recent Video training SOTA: I3D, SlowFast, Non-Local

It draws inspiration from FixRes that it requires a finetuning stage at the end to match train/test descrepancy.

Key ideas

  • Use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to schedule. Significant speedup for diff models, diff datasets, diff training settings (w/ or w/o pretraining, 128 vs 1 GPU).

  • Train on coarse grid, then on finer grid, then finally finetune on same grid as inference setting.

  • Only require small changes to dataloader.

  • Base Batch size BxTxHxW(x3)

    • Long cycle
      • 8B x T/4 x H/sqrt(2) x W/sqrt(2)
      • 4B x T/2 x H/sqrt(2) x W/sqrt(2)
      • 2B x T/2 x H x W
      • B x T x H x W
    • Short cycle
      • Baseshape from long cycle
      • H/sqrt(2) x W/sqrt(2)
      • H/2 x W/2
  • L-1 LR stage use multi-grid. Last LR stage use the baseline minibatch shape.

  • Long cycle and short cycle

    • Mixture yields best performance
  • Batch Normalization: standard batch size of 8 wrt long cycle. Increase batch size wrt short cycle.

Technical details

  • Linear scaling rule
  • Cosine learning schedule. This seems to yield similar performance to stagewise training schedule.
  • Temporal subsampling: non-uniform stride
  • May become I/O bound
  • Training beyond 1 to 2 epoches hurt performance.

Notes

  • Can we apply this to images?
  • Temporal subsampling in the long cycle seem to hurt performance. Can we just downsample the spatial resolution? Short cycle do not downsample time and leads to better performance. Maybe the time dimension augmentation/subsampling altered the meaning of video.