Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

arXiv | Primary contact: Yitian Zhang

TL,DR

Motivation: Existing object recognition training protocols involve multiple data augmentation but neglect Hue Jittering, which leads to appearance variation. However, we find its beneficial effect in video understanding since the appearance variation implicitly prioritizes motion information (see Figure above).
Challenges: (1) Inefficient implementation of Hue Jittering (2) Distribution shift caused by Hue Variance.
Solution: We propose Motion Coherent Augmentation (MCA), which is composed of (1) SwapMix: modifies the appearances of video samples efficiently; (2) Variation Alignment: resolves the distribution shift caused by SwapMix.
Strength: (1) Obvious Performance Gain (2) Generalization ability over different architectures and datasets (3) Compatibility with other augmentation methods (4) Application of Variation Alignment in other augmentation methods for even higher performance

Datasets

Please follow the instruction of TSM to prepare the Something-Something V1/V2, Kinetics400, HMDB51 datasets.

Support Models

MCA is a general data augmentation method and can be easily applied to existing methods for stronger performance with a few lines of code:

######  Hyperparameter  ######

Beta = 1.0
MCA_Prob = 1.0
Lambda_AV = 1.0

######  SwapMix  ######

r = np.random.rand(1)
if r < MCA_Prob:
    # generate co-efficient lambda
    batch_num = inputs.shape[0]
    lam = torch.from_numpy(np.random.beta(Beta, Beta, batch_num)).view(-1,1,1,1,1).cuda()
    # random shuffle channel order
    rand_index = torch.randperm(3).cuda()
    while (rand_index - torch.tensor([0,1,2]).cuda()).abs().sum() == 0:
        rand_index = torch.randperm(3).cuda()
    # interpolation for enlarged input space
    inputs_color = lam * inputs + (1-lam) * inputs[:, rand_index]
    

######  Variation Alignment  ######    

if r < MCA_Prob:
    # construct training pair
    inputs_cat = []
    inputs_cat.append(torch.cat((inputs,inputs_color),0))
    output = model(input_cat)
    loss = criterion(output[:batch_num], target)
    loss_kl = Lambda_AV * nn.KLDivLoss(reduction='batchmean')(nn.LogSoftmax(dim=1)(output[batch_num:]), nn.Softmax(dim=1)(output[:batch_num].detach()))
    loss += loss_kl
else:
    output = model(inputs)
    loss = criterion(output, target)

Note that Variation Alignment can be easily extended to resolve the distribution shift of other augmentation methods by replacing inputs_color with inputs_aug, which are training samples generated by other augmentation operations:

 inputs_cat.append(torch.cat((inputs,inputs_aug),0))

Currently, MCA supports the implementation of 2D Network: TSM; 3D Network: SlowFast; Transformer Network: Uniformer.

Result

Validation Across Architectures

MCA can obviously outperform the baseline method on different architectures on Something-Something V1 dataset.

Here we provide the pretrained models on all these architectures:

Model	Acc1.	Weight
TSM	45.63%	link
TSM-MCA	47.57%	link

SlowFast	44.12%	link
SlowFast-MCA	45.88%	link

Uniformer	48.48%	link
Uniformer-MCA	50.51%	link

Validation Across Datasets

MCA can obviously outperform the baseline method on different datasets.

Here we provide the pretrained models on Something-Something V2:

Model	Acc1.	Weight
TSM	59.29%	link
TSM-MCA	60.71%	link

and Kinetics400:

Model	Acc1.	Weight
TSM	70.28%	link
TSM-MCA	71.08%	link

Compatibility with other augmentation methods

Application of Variation Alignment in other augmentation methods

Get Started

We provide a comprehensive codebase for video recognition which contains the implementation of 2D Network, 3D Network and Transformer Network. Please go to the folders for specific docs.

Acknowledgment

Our codebase is heavily build upon TSM, SlowFast, Uniformer and FFN. We gratefully thank the authors for their wonderful works. The README file format is heavily based on the GitHub repos of my colleague Huan Wang, Xu Ma and Yizhou Wang. Great thanks to them! We also greatly thank the anounymous ICLR'24 reviewers for the constructive comments to help us improve the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
2D_Network		2D_Network
3D_Network		3D_Network
Transformer_Network		Transformer_Network
fig		fig
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2D_Network

2D_Network

3D_Network

3D_Network

Transformer_Network

Transformer_Network

fig

fig

.DS_Store

.DS_Store

README.md

README.md

Repository files navigation

Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

TL,DR

Datasets

Support Models

Result

Get Started

Acknowledgment

About

Releases

Packages

Languages

BeSpontaneous/MCA-pytorch

Folders and files

Latest commit

History

Repository files navigation

Don't Judge by the Look: Towards Motion Coherent Video Representation (ICLR2024)

TL,DR

Datasets

Support Models

Result

Get Started

Acknowledgment

About

Resources

Stars

Watchers

Forks

Languages