Skip to content

BeSpontaneous/MCA-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

arXiv | Primary contact: Yitian Zhang

TL,DR

  • Motivation: Existing object recognition training protocols involve multiple data augmentation but neglect Hue Jittering, which leads to appearance variation. However, we find its beneficial effect in video understanding since the appearance variation implicitly prioritizes motion information (see Figure above).
  • Challenges: (1) Inefficient implementation of Hue Jittering (2) Distribution shift caused by Hue Variance.
  • Solution: We propose Motion Coherent Augmentation (MCA), which is composed of (1) SwapMix: modifies the appearances of video samples efficiently; (2) Variation Alignment: resolves the distribution shift caused by SwapMix.
  • Strength: (1) Obvious Performance Gain (2) Generalization ability over different architectures and datasets (3) Compatibility with other augmentation methods (4) Application of Variation Alignment in other augmentation methods for even higher performance

Datasets

Please follow the instruction of TSM to prepare the Something-Something V1/V2, Kinetics400, HMDB51 datasets.

Support Models

MCA is a general data augmentation method and can be easily applied to existing methods for stronger performance with a few lines of code:

######  Hyperparameter  ######

Beta = 1.0
MCA_Prob = 1.0
Lambda_AV = 1.0

######  SwapMix  ######

r = np.random.rand(1)
if r < MCA_Prob:
    # generate co-efficient lambda
    batch_num = inputs.shape[0]
    lam = torch.from_numpy(np.random.beta(Beta, Beta, batch_num)).view(-1,1,1,1,1).cuda()
    # random shuffle channel order
    rand_index = torch.randperm(3).cuda()
    while (rand_index - torch.tensor([0,1,2]).cuda()).abs().sum() == 0:
        rand_index = torch.randperm(3).cuda()
    # interpolation for enlarged input space
    inputs_color = lam * inputs + (1-lam) * inputs[:, rand_index]
    

######  Variation Alignment  ######    

if r < MCA_Prob:
    # construct training pair
    inputs_cat = []
    inputs_cat.append(torch.cat((inputs,inputs_color),0))
    output = model(input_cat)
    loss = criterion(output[:batch_num], target)
    loss_kl = Lambda_AV * nn.KLDivLoss(reduction='batchmean')(nn.LogSoftmax(dim=1)(output[batch_num:]), nn.Softmax(dim=1)(output[:batch_num].detach()))
    loss += loss_kl
else:
    output = model(inputs)
    loss = criterion(output, target)

Note that Variation Alignment can be easily extended to resolve the distribution shift of other augmentation methods by replacing inputs_color with inputs_aug, which are training samples generated by other augmentation operations:

 inputs_cat.append(torch.cat((inputs,inputs_aug),0))

Currently, MCA supports the implementation of 2D Network: TSM; 3D Network: SlowFast; Transformer Network: Uniformer.

Result

  • Validation Across Architectures

MCA can obviously outperform the baseline method on different architectures on Something-Something V1 dataset.

Here we provide the pretrained models on all these architectures:

Model Acc1. Weight
TSM 45.63% link
TSM-MCA 47.57% link
SlowFast 44.12% link
SlowFast-MCA 45.88% link
Uniformer 48.48% link
Uniformer-MCA 50.51% link
  • Validation Across Datasets

MCA can obviously outperform the baseline method on different datasets.

Here we provide the pretrained models on Something-Something V2:

Model Acc1. Weight
TSM 59.29% link
TSM-MCA 60.71% link

and Kinetics400:

Model Acc1. Weight
TSM 70.28% link
TSM-MCA 71.08% link
  • Compatibility with other augmentation methods
  • Application of Variation Alignment in other augmentation methods

Get Started

We provide a comprehensive codebase for video recognition which contains the implementation of 2D Network, 3D Network and Transformer Network. Please go to the folders for specific docs.

Acknowledgment

Our codebase is heavily build upon TSM, SlowFast, Uniformer and FFN. We gratefully thank the authors for their wonderful works. The README file format is heavily based on the GitHub repos of my colleague Huan Wang, Xu Ma and Yizhou Wang. Great thanks to them! We also greatly thank the anounymous ICLR'24 reviewers for the constructive comments to help us improve the paper.

About

Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published