Skip to content

Mamin78/Next_Frame_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Next-Frame Prediction

Next-frame prediction is a new, promising field of research in computer vision, predicting possible future images by presenting historical image information. It provides extensive application value in robot decision-making and autonomous driving.

  • Model
    The Convolutional LSTM architectures bring together time series processing and computer vision by introducing a convolutional recurrent cell in a LSTM layer. Using the Convolutional LSTM model, we explored next-frame prediction - the model of predicting what the next video frame will look like from a series of previous frames.

    • 1.1. Fast Fourier Convolution
      Vanilla convolutions in modern deep networks are known to operate locally and at fixed scale (e.g., the widely-adopted 3 × 3 kernels in image-oriented tasks). This causes low efficiency in connecting two distant locations in the network. In this work, we propose a novel convolutional operator dubbed as Fast Fourier convolution (FFC), which has the main hallmarks of non-local receptive fields and cross-scale fusion within the convolutional unit. According to the spectral convolution theorem in Fourier theory, point-wise update in the spectral domain globally affects all input features involved in Fourier transform, which sheds light on neural architectural design with non-local receptive fields. Idea is to convert the spatial features to spectral features — apply some operations — convert back to spatial features. Operations in spectral-domain indicate the receptive field of convolution to the full resolution of the input feature map (Chi et al., 2020).
      In this project, in order to improve the convolution layers, we replaced them with fast fourier convolutions(FFC). Furthermore, each ffc layer is followed by a self-attention layer.
  • Dataset
    The Moving MNIST dataset contains 10,000 video sequences, each consisting of 20 frames. In each video sequence, two digits move independently around the frame, which has a spatial resolution of 64×64 pixels. The digits frequently intersect with each other and bounce off the edges of the frame.

  • Results

As you can see, The best result was achieved when we used ConvLSTM with a self-attention layer. The second-best result is the base model, which just uses ConvLSTM cells. We got the worst result when we used FFC instead of simple convolution layers. This is probably because of the complexity of the FfcLSTM model. As you can see in the chart above, the validation loss of this model is worse than the loss on the train set. It seems that this model has been overfitted on the train set.

About

Predicts the next frame of a series of gifs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published